Crash in OOM | small with BuildDisplayListForChild

NEW
Assigned to

Status

()

defect
P3
critical
2 years ago
3 months ago

People

(Reporter: philipp, Assigned: mattwoodrow, NeedInfo)

Tracking

({crash, leave-open, regression})

58 Branch
All
Windows
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox-esr60 affected, firefox57 unaffected, firefox58 wontfix, firefox59 wontfix, firefox60 wontfix, firefox61 wontfix, firefox62 fix-optional)

Details

(crash signature)

Attachments

(1 attachment)

Reporter

Description

2 years ago
This bug was filed from the Socorro interface and is
report bp-bc93d3d2-ea43-41d7-8bcf-6c7ad0171128.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll NS_ABORT_OOM xpcom/base/nsDebugImpl.cpp:620
1 xul.dll nsPresArena::Allocate layout/base/nsPresArena.cpp:148
2 xul.dll nsDisplayListBuilder::CreateClipChainIntersection layout/painting/nsDisplayList.cpp:1614
3 xul.dll nsDisplayListBuilder::CopyWholeChain layout/painting/nsDisplayList.cpp:1622
4 xul.dll nsDisplayListBuilder::MarkOutOfFlowFrameForDisplay layout/painting/nsDisplayList.cpp:1200
5 xul.dll nsDisplayListBuilder::MarkFramesForDisplayList layout/painting/nsDisplayList.cpp:1477
6 xul.dll nsIFrame::MarkAbsoluteFramesForDisplayList layout/generic/nsFrame.cpp:3767
7 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsFrame.cpp:3664
8 xul.dll nsFlexContainerFrame::BuildDisplayList layout/generic/nsFlexContainerFrame.cpp:2267
9 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsFrame.cpp:3717

=============================================================

out of memory crashes on windows with BuildDisplayListForChild in their proto signature are rising in the 58 cycle. 

on 58.0b there are around 150 daily more daily reports of this than before:
https://crash-stats.mozilla.com/signature/?product=Firefox&submitted_from_infobar=%21__true__&proto_signature=~BuildDisplayListForChild&release_channel=beta&signature=OOM%20%7C%20small&date=%3E%3D2017-08-01#graphs

during the 58 nightly cycle these crashes started spiking up around 2017-10-30:
https://crash-stats.mozilla.com/signature/?product=Firefox&submitted_from_infobar=%21__true__&proto_signature=~BuildDisplayListForChild&release_channel=nightly&signature=OOM%20%7C%20small&date=%3E%3D2017-08-01#graphs
Jet, can you find someone to take a look?
Flags: needinfo?(bugs)
(wondering if we started experiments with RDL on beta yet...)
Bug 1411881 landed around the time this spiked on Nightly. Maybe related?

Otherwise, below is a very rough pushlog range for around the time this regressed. A few other display list changes in there too.
https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2017-10-29&enddate=2017-10-31
Component: Layout → Layout: Web Painting
Flags: needinfo?(bugs) → needinfo?(matt.woodrow)
Mark 58 blocking as the volume of crashes is very huge.
(In reply to Mike Taylor [:miketaylr] (58 Regression Engineering Owner) from comment #2)
> (wondering if we started experiments with RDL on beta yet...)

layout.display-list.retain is still false in 59.0b8.
Matt, is there any correlation with refactoring in code around that may not be guarded by the pref?
Matt, any ideas?
Flags: needinfo?(matt.woodrow)
(oops, accidentally cleared ni?)
Flags: needinfo?(matt.woodrow)
Assignee

Comment 8

2 years ago
I'm working on trying to narrow this down.

Crash reports themselves don't show a lot, most look like normal OOM, though some seem to be OOM crashes with fairly low memory usage.

It's possible that there are multiple issues causing this.

Do we know if any other OOM signatures dropped around this time?
Assignee: nobody → matt.woodrow
Reporter

Comment 9

2 years ago
on the beta channel the generic [@ OOM|small] signature seems to be rising in 58:
https://crash-stats.mozilla.com/signature/?submitted_from_infobar=%21__true__&release_channel=beta&product=Firefox&signature=OOM%20%7C%20small&date=%3E%3D2017-06-08T12%3A14%3A13.000Z&date=%3C2017-12-08T11%3A14%3A13.000Z#graphs (when looking into that i stumbled upon the crashes described in this bug report)

a signature related to memory pressure that has dropped in 58 is [@ EMPTY: no crashing thread identified; ERROR_NO_MINIDUMP_HEADER] - but since those reports don't contain much information it's hard to attribute the decline to something in particular.
Assignee

Comment 10

2 years ago
Ok, that does sound like a real memory usage increase.

Looking at the crash stats, it looks like this happened no more than once per Nightly until the 1022 and 1023 builds where we had two crashes, 4 in the 1027 build, 6 and 11 in the two 1029 builds, and then 14 in the 1030 build (taken from build id aggregations).

It's really hard to know exactly when it started because of the amount of variance there, the 4 crashes in 1027 seems it likely had the bug, but we went back to 1 crash in 1028.

Do we have any way of measuring uptake/usage of each Nightly build? I assume Nightlies from certain days of the week get more usage than others, but I don't know how to quantify that and apply it these results.

Wider regression range: https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2017-10-21&enddate=2017-10-30

If we assume the 1022 and 1023 results are significant, then the regression must have happened on 1021/1022, and there's nothing interesting (relating to display lists) there. The main retained-dl code landed on the 23rd.

Bug 1405146 landed on the 25th, that one seems like it could increase the total memory used during painting (probably not a huge amount, but it depends on the page). That doesn't fit the timing perfectly, but it's possible.


I've setup ASAN and DMD builds on my local Windows machine, but haven't been able to reproduce any leaking or corruption.
The daily crash reports are still quite an amount(over 100+/day). Did we see any memory regression within 1021/1022/1023 builds?
Flags: needinfo?(matt.woodrow)
Assignee

Comment 13

2 years ago
I'm struggling to make progress with this. A lot of the reports really don't seem to be in a particularly low memory state at all, like this one: https://crash-stats.mozilla.com/report/index/3d6da33d-850b-49a0-8d15-afbda0180105

8.79TB left of virtual memory, 2.32GB left of physical memory, and 6.46GB left of the page file.

Crashing with an OOM with those numbers seems really suspicious to me. Still trying to figure out more.
Assignee

Comment 14

2 years ago
Nathan, do you have any ideas why we'd fail to allocate memory when there's still so much left?

The only alternative I can think of is that ArenaChunk::header::offset is 0 (which has been seen before, in bug 1406727 comment 36) which makes ArenaChunk::Allocate return nullptr despite not really being OOM.

That seems like it should be much too rare to cause this volume of crashes though.
Flags: needinfo?(matt.woodrow) → needinfo?(nfroyd)
See Also: → 1418806
Skimming through, some of them have a very small amount of contiguous free memory left, e.g.

https://crash-stats.mozilla.com/report/index/401dc51d-8b97-4ed7-bf34-473570180109
https://crash-stats.mozilla.com/report/index/98afc26e-5a76-4ec2-89ea-0bbe60180109
https://crash-stats.mozilla.com/report/index/1ba9339b-80e3-407a-9f61-42d8b0180109

all have a largest contiguous VM block of < 2MB.  If the allocation winds up requesting blocks of memory from the OS, and requests 2MB chunks when it does so (to carve larger blocks out of), you're going to be out of luck.  That's just life.

But that doesn't explain the crash in comment 13, or one like:

https://crash-stats.mozilla.com/report/index/0020ff4d-f4af-4988-97f7-667cd0180109

which both have tons of space--total virtual/physical and large chunks of VM--unless the largest contiguous VM block measurements (see the "largest_free_vm_block" field in the Raw Dump tab) are completely out of whack.  But then that'd be some massive fragmentation, given that there's so much space left.

There's also things like:

https://crash-stats.mozilla.com/report/index/1c1fd1a7-203a-49cf-9f1a-a7e750180109

which has ~4MB of contiguous VM space left, but still OOMs.

ArenaChunk::header::offset being busted seems reasonable to me, but then I don't know what made it that way.  More canaries are in order? =/
Flags: needinfo?(nfroyd)
Assignee

Comment 16

2 years ago
Worth a shot at least!
Attachment #8942598 - Flags: review?(nfroyd)
Comment on attachment 8942598 [details] [diff] [review]
Check the canary during allocations

r+ to get this on to Beta and get crash reports back ASAP. 

Nathan: I'll leave a NI on you to have a look at the first reports that come back. Thx!
Flags: needinfo?(nfroyd)
Attachment #8942598 - Flags: review?(nfroyd) → review+
Comment on attachment 8942598 [details] [diff] [review]
Check the canary during allocations

Approval Request Comment
[Feature/Bug causing the regression]:
See bug 1421345.

[User impact if declined]:
Undiagnosed OOM crashes.

[Is this code covered by automated tests?]:
Yes

[Has the fix been verified in Nightly?]:
This is not a fix. It's diagnostic code to help identify a root cause for OOM crashes when there's still available memory.

[Needs manual test from QE? If yes, steps to reproduce]: 
No.

[List of other uplifts needed for the feature/fix]:
No.

[Is the change risky?]:
Low risk

[Why is the change risky/not risky?]:
Diagnostic code that we'll pull out before we ship to Release.

[String changes made/needed]:
None.
Attachment #8942598 - Flags: approval-mozilla-beta?
Comment on attachment 8942598 [details] [diff] [review]
Check the canary during allocations

For debug purpose. Beta58+.
Attachment #8942598 - Flags: approval-mozilla-release+
Attachment #8942598 - Flags: approval-mozilla-beta?
Attachment #8942598 - Flags: approval-mozilla-beta+
Backed out for bustage at dist/include/mozilla/ArenaAllocator.h:180:7: 'canary' was not declared in this scope:

https://hg.mozilla.org/releases/mozilla-beta/rev/9579dad4492b9ce9e2be0379bae320e1f6327394
https://hg.mozilla.org/releases/mozilla-release/rev/fae7c41d40fd8ddb4d6d0ade34af7c75fef0e4d5

Push with bustage:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=814254bd1eb76533621eea0700d0182aa3121350&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Build log: https://treeherder.mozilla.org/logviewer.html#?job_id=156352910&repo=mozilla-release

[task 2018-01-15T11:49:56.529Z] 11:49:56     INFO -  gmake[5]: Entering directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/base'
[task 2018-01-15T11:49:56.530Z] 11:49:56     INFO -  /usr/bin/ccache /builds/worker/workspace/build/src/gcc/bin/g++ -std=gnu++11 -o Unified_cpp_xpcom_base0.o -c -I/builds/worker/workspace/build/src/obj-firefox/dist/stl_wrappers -I/builds/worker/workspace/build/src/obj-firefox/dist/system_wrappers -include /builds/worker/workspace/build/src/config/gcc_hidden.h -DNDEBUG=1 -DTRIMMED=1 -DOS_POSIX=1 -DOS_LINUX=1 -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -I/builds/worker/workspace/build/src/xpcom/base -I/builds/worker/workspace/build/src/obj-firefox/xpcom/base -I/builds/worker/workspace/build/src/obj-firefox/ipc/ipdl/_ipdlheaders -I/builds/worker/workspace/build/src/ipc/chromium/src -I/builds/worker/workspace/build/src/ipc/glue -I/builds/worker/workspace/build/src/xpcom/build -I/builds/worker/workspace/build/src/dom/base -I/builds/worker/workspace/build/src/xpcom/ds -I/builds/worker/workspace/build/src/obj-firefox/dist/include -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nspr -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nss -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/build/src/obj-firefox/mozilla-config.h -Wall -Wc++11-compat -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wc++14-compat -Wduplicated-cond -Wno-error=maybe-uninitialized -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=coverage-mismatch -Wno-error=free-nonheap-object -Wformat -fno-exceptions -fno-strict-aliasing -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -D_GLIBCXX_USE_CXX11_ABI=0 -pipe -g -O3 -fomit-frame-pointer -Werror -I/builds/worker/workspace/build/src/widget/gtk/compat-gtk3 -pthread -I/builds/worker/workspace/build/src/gtk3/usr/local/include/gtk-3.0/unix-print -I/builds/worker/workspace/build/src/gtk3/usr/local/include/gtk-3.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/include/gio-unix-2.0/ -I/builds/worker/workspace/build/src/gtk3/usr/local/include/cairo -I/builds/worker/workspace/build/src/gtk3/usr/local/include/pango-1.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/include/atk-1.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/include/cairo -I/builds/worker/workspace/build/src/gtk3/usr/local/include/pixman-1 -I/builds/worker/workspace/build/src/gtk3/usr/local/include -I/builds/worker/workspace/build/src/gtk3/usr/local/include/gdk-pixbuf-2.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/include/glib-2.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/lib/glib-2.0/include -I/builds/worker/workspace/build/src/gtk3/usr/include/freetype2 -I/builds/worker/workspace/build/src/gtk3/usr/include/libpng12 -fprofile-generate -MD -MP -MF .deps/Unified_cpp_xpcom_base0.o.pp   /builds/worker/workspace/build/src/obj-firefox/xpcom/base/Unified_cpp_xpcom_base0.cpp
[task 2018-01-15T11:49:56.531Z] 11:49:56     INFO -  In file included from /builds/worker/workspace/build/src/obj-firefox/dist/include/nsPresArena.h:13:0,
[task 2018-01-15T11:49:56.531Z] 11:49:56     INFO -                   from /builds/worker/workspace/build/src/obj-firefox/dist/include/nsIPresShell.h:38,
[task 2018-01-15T11:49:56.531Z] 11:49:56     INFO -                   from /builds/worker/workspace/build/src/obj-firefox/dist/include/nsPresContext.h:19,
[task 2018-01-15T11:49:56.531Z] 11:49:56     INFO -                   from /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/dom/Element.h:28,
[task 2018-01-15T11:49:56.531Z] 11:49:56     INFO -                   from /builds/worker/workspace/build/src/dom/base/nsDOMMutationObserver.h:20,
[task 2018-01-15T11:49:56.531Z] 11:49:56     INFO -                   from /builds/worker/workspace/build/src/xpcom/base/CycleCollectedJSContext.cpp:35,
[task 2018-01-15T11:49:56.531Z] 11:49:56     INFO -                   from /builds/worker/workspace/build/src/obj-firefox/xpcom/base/Unified_cpp_xpcom_base0.cpp:20:
[task 2018-01-15T11:49:56.532Z] 11:49:56     INFO -  /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/ArenaAllocator.h: In member function 'void* mozilla::ArenaAllocator<ArenaSize, Alignment>::ArenaChunk::Allocate(size_t)':
[task 2018-01-15T11:49:56.532Z] 11:49:56     INFO -  /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/ArenaAllocator.h:180:7: error: 'canary' was not declared in this scope
[task 2018-01-15T11:49:56.532Z] 11:49:56     INFO -         canary.Check();
[task 2018-01-15T11:49:56.532Z] 11:49:56     INFO -         ^~~~~~
[task 2018-01-15T11:49:56.534Z] 11:49:56     INFO -  /builds/worker/workspace/build/src/config/rules.mk:1028: recipe for target 'Unified_cpp_xpcom_base0.o' failed
[task 2018-01-15T11:49:56.534Z] 11:49:56     INFO -  gmake[5]: *** [Unified_cpp_xpcom_base0.o] Error 1
[task 2018-01-15T11:49:56.535Z] 11:49:56     INFO -  gmake[5]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/base'
[task 2018-01-15T11:49:56.535Z] 11:49:56     INFO -  /builds/worker/workspace/build/src/config/recurse.mk:73: recipe for target 'xpcom/base/target' failed
[task 2018-01-15T11:49:56.535Z] 11:49:56     INFO -  gmake[4]: *** [xpcom/base/target] Error 2
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(bugs)
Looks like we'll also need to uplift at least the patch for bug 1406727 comment 46. I'll let Matt request the required uplifts for that one.
Flags: needinfo?(bugs)
Whiteboard: [leave open]
Keywords: leave-open
Whiteboard: [leave open]
Assignee

Comment 23

2 years ago
Thanks for jumping on this! I've made the extra uplift request in bug 1406727.
Flags: needinfo?(matt.woodrow)
Assignee

Updated

2 years ago
Priority: -- → P1

Comment 24

Last year
Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4bb6a59e797f
Check the canary during allocations. r=jet
Assignee

Updated

Last year
Depends on: 1430962
Per email thread, this is not going to block the 58 release.
Comment on attachment 8942598 [details] [diff] [review]
Check the canary during allocations

this isn't going to be on 58 after all.
Attachment #8942598 - Flags: approval-mozilla-release+
Attachment #8942598 - Flags: approval-mozilla-beta+
Moving to p3 because no activity for at least 24 weeks.
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P1 → P3
You need to log in before you can comment on or make changes to this bug.