This bug was filed from the Socorro interface and is report bp-bc93d3d2-ea43-41d7-8bcf-6c7ad0171128. ============================================================= Top 10 frames of crashing thread: 0 xul.dll NS_ABORT_OOM xpcom/base/nsDebugImpl.cpp:620 1 xul.dll nsPresArena::Allocate layout/base/nsPresArena.cpp:148 2 xul.dll nsDisplayListBuilder::CreateClipChainIntersection layout/painting/nsDisplayList.cpp:1614 3 xul.dll nsDisplayListBuilder::CopyWholeChain layout/painting/nsDisplayList.cpp:1622 4 xul.dll nsDisplayListBuilder::MarkOutOfFlowFrameForDisplay layout/painting/nsDisplayList.cpp:1200 5 xul.dll nsDisplayListBuilder::MarkFramesForDisplayList layout/painting/nsDisplayList.cpp:1477 6 xul.dll nsIFrame::MarkAbsoluteFramesForDisplayList layout/generic/nsFrame.cpp:3767 7 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsFrame.cpp:3664 8 xul.dll nsFlexContainerFrame::BuildDisplayList layout/generic/nsFlexContainerFrame.cpp:2267 9 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsFrame.cpp:3717 ============================================================= out of memory crashes on windows with BuildDisplayListForChild in their proto signature are rising in the 58 cycle. on 58.0b there are around 150 daily more daily reports of this than before: https://crash-stats.mozilla.com/signature/?product=Firefox&submitted_from_infobar=%21__true__&proto_signature=~BuildDisplayListForChild&release_channel=beta&signature=OOM%20%7C%20small&date=%3E%3D2017-08-01#graphs during the 58 nightly cycle these crashes started spiking up around 2017-10-30: https://crash-stats.mozilla.com/signature/?product=Firefox&submitted_from_infobar=%21__true__&proto_signature=~BuildDisplayListForChild&release_channel=nightly&signature=OOM%20%7C%20small&date=%3E%3D2017-08-01#graphs
Jet, can you find someone to take a look?
(wondering if we started experiments with RDL on beta yet...)
Bug 1411881 landed around the time this spiked on Nightly. Maybe related? Otherwise, below is a very rough pushlog range for around the time this regressed. A few other display list changes in there too. https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2017-10-29&enddate=2017-10-31
(In reply to Mike Taylor [:miketaylr] (58 Regression Engineering Owner) from comment #2) > (wondering if we started experiments with RDL on beta yet...) layout.display-list.retain is still false in 59.0b8. Matt, is there any correlation with refactoring in code around that may not be guarded by the pref?
Matt, any ideas?
(oops, accidentally cleared ni?)
I'm working on trying to narrow this down. Crash reports themselves don't show a lot, most look like normal OOM, though some seem to be OOM crashes with fairly low memory usage. It's possible that there are multiple issues causing this. Do we know if any other OOM signatures dropped around this time?
Assignee: nobody → matt.woodrow
on the beta channel the generic [@ OOM|small] signature seems to be rising in 58: https://crash-stats.mozilla.com/signature/?submitted_from_infobar=%21__true__&release_channel=beta&product=Firefox&signature=OOM%20%7C%20small&date=%3E%3D2017-06-08T12%3A14%3A13.000Z&date=%3C2017-12-08T11%3A14%3A13.000Z#graphs (when looking into that i stumbled upon the crashes described in this bug report) a signature related to memory pressure that has dropped in 58 is [@ EMPTY: no crashing thread identified; ERROR_NO_MINIDUMP_HEADER] - but since those reports don't contain much information it's hard to attribute the decline to something in particular.
Ok, that does sound like a real memory usage increase. Looking at the crash stats, it looks like this happened no more than once per Nightly until the 1022 and 1023 builds where we had two crashes, 4 in the 1027 build, 6 and 11 in the two 1029 builds, and then 14 in the 1030 build (taken from build id aggregations). It's really hard to know exactly when it started because of the amount of variance there, the 4 crashes in 1027 seems it likely had the bug, but we went back to 1 crash in 1028. Do we have any way of measuring uptake/usage of each Nightly build? I assume Nightlies from certain days of the week get more usage than others, but I don't know how to quantify that and apply it these results. Wider regression range: https://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2017-10-21&enddate=2017-10-30 If we assume the 1022 and 1023 results are significant, then the regression must have happened on 1021/1022, and there's nothing interesting (relating to display lists) there. The main retained-dl code landed on the 23rd. Bug 1405146 landed on the 25th, that one seems like it could increase the total memory used during painting (probably not a huge amount, but it depends on the page). That doesn't fit the timing perfectly, but it's possible. I've setup ASAN and DMD builds on my local Windows machine, but haven't been able to reproduce any leaking or corruption.
The daily crash reports are still quite an amount(over 100+/day). Did we see any memory regression within 1021/1022/1023 builds?
Looking at https://crash-stats.mozilla.com/search/?proto_signature=~nsIFrame%3A%3ABuildDisplayListForChild&product=Firefox&date=%3E%3D2018-01-03T04%3A17%3A25.000Z&date=%3C2018-01-10T04%3A17%3A25.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature, this is super high... what's this status for this 58 blocker? thanks
I'm struggling to make progress with this. A lot of the reports really don't seem to be in a particularly low memory state at all, like this one: https://crash-stats.mozilla.com/report/index/3d6da33d-850b-49a0-8d15-afbda0180105 8.79TB left of virtual memory, 2.32GB left of physical memory, and 6.46GB left of the page file. Crashing with an OOM with those numbers seems really suspicious to me. Still trying to figure out more.
Nathan, do you have any ideas why we'd fail to allocate memory when there's still so much left? The only alternative I can think of is that ArenaChunk::header::offset is 0 (which has been seen before, in bug 1406727 comment 36) which makes ArenaChunk::Allocate return nullptr despite not really being OOM. That seems like it should be much too rare to cause this volume of crashes though.
Flags: needinfo?(matt.woodrow) → needinfo?(nfroyd)
Skimming through, some of them have a very small amount of contiguous free memory left, e.g. https://crash-stats.mozilla.com/report/index/401dc51d-8b97-4ed7-bf34-473570180109 https://crash-stats.mozilla.com/report/index/98afc26e-5a76-4ec2-89ea-0bbe60180109 https://crash-stats.mozilla.com/report/index/1ba9339b-80e3-407a-9f61-42d8b0180109 all have a largest contiguous VM block of < 2MB. If the allocation winds up requesting blocks of memory from the OS, and requests 2MB chunks when it does so (to carve larger blocks out of), you're going to be out of luck. That's just life. But that doesn't explain the crash in comment 13, or one like: https://crash-stats.mozilla.com/report/index/0020ff4d-f4af-4988-97f7-667cd0180109 which both have tons of space--total virtual/physical and large chunks of VM--unless the largest contiguous VM block measurements (see the "largest_free_vm_block" field in the Raw Dump tab) are completely out of whack. But then that'd be some massive fragmentation, given that there's so much space left. There's also things like: https://crash-stats.mozilla.com/report/index/1c1fd1a7-203a-49cf-9f1a-a7e750180109 which has ~4MB of contiguous VM space left, but still OOMs. ArenaChunk::header::offset being busted seems reasonable to me, but then I don't know what made it that way. More canaries are in order? =/
Worth a shot at least!
Attachment #8942598 - Flags: review?(nfroyd)
Comment on attachment 8942598 [details] [diff] [review] Check the canary during allocations r+ to get this on to Beta and get crash reports back ASAP. Nathan: I'll leave a NI on you to have a look at the first reports that come back. Thx!
Attachment #8942598 - Flags: review?(nfroyd) → review+
Comment on attachment 8942598 [details] [diff] [review] Check the canary during allocations Approval Request Comment [Feature/Bug causing the regression]: See bug 1421345. [User impact if declined]: Undiagnosed OOM crashes. [Is this code covered by automated tests?]: Yes [Has the fix been verified in Nightly?]: This is not a fix. It's diagnostic code to help identify a root cause for OOM crashes when there's still available memory. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: No. [Is the change risky?]: Low risk [Why is the change risky/not risky?]: Diagnostic code that we'll pull out before we ship to Release. [String changes made/needed]: None.
Comment on attachment 8942598 [details] [diff] [review] Check the canary during allocations For debug purpose. Beta58+.
Backed out for bustage at dist/include/mozilla/ArenaAllocator.h:180:7: 'canary' was not declared in this scope: https://hg.mozilla.org/releases/mozilla-beta/rev/9579dad4492b9ce9e2be0379bae320e1f6327394 https://hg.mozilla.org/releases/mozilla-release/rev/fae7c41d40fd8ddb4d6d0ade34af7c75fef0e4d5 Push with bustage: https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=814254bd1eb76533621eea0700d0182aa3121350&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable Build log: https://treeherder.mozilla.org/logviewer.html#?job_id=156352910&repo=mozilla-release [task 2018-01-15T11:49:56.529Z] 11:49:56 INFO - gmake: Entering directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/base' [task 2018-01-15T11:49:56.530Z] 11:49:56 INFO - /usr/bin/ccache /builds/worker/workspace/build/src/gcc/bin/g++ -std=gnu++11 -o Unified_cpp_xpcom_base0.o -c -I/builds/worker/workspace/build/src/obj-firefox/dist/stl_wrappers -I/builds/worker/workspace/build/src/obj-firefox/dist/system_wrappers -include /builds/worker/workspace/build/src/config/gcc_hidden.h -DNDEBUG=1 -DTRIMMED=1 -DOS_POSIX=1 -DOS_LINUX=1 -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -I/builds/worker/workspace/build/src/xpcom/base -I/builds/worker/workspace/build/src/obj-firefox/xpcom/base -I/builds/worker/workspace/build/src/obj-firefox/ipc/ipdl/_ipdlheaders -I/builds/worker/workspace/build/src/ipc/chromium/src -I/builds/worker/workspace/build/src/ipc/glue -I/builds/worker/workspace/build/src/xpcom/build -I/builds/worker/workspace/build/src/dom/base -I/builds/worker/workspace/build/src/xpcom/ds -I/builds/worker/workspace/build/src/obj-firefox/dist/include -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nspr -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nss -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/build/src/obj-firefox/mozilla-config.h -Wall -Wc++11-compat -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wc++14-compat -Wduplicated-cond -Wno-error=maybe-uninitialized -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=coverage-mismatch -Wno-error=free-nonheap-object -Wformat -fno-exceptions -fno-strict-aliasing -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -D_GLIBCXX_USE_CXX11_ABI=0 -pipe -g -O3 -fomit-frame-pointer -Werror -I/builds/worker/workspace/build/src/widget/gtk/compat-gtk3 -pthread -I/builds/worker/workspace/build/src/gtk3/usr/local/include/gtk-3.0/unix-print -I/builds/worker/workspace/build/src/gtk3/usr/local/include/gtk-3.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/include/gio-unix-2.0/ -I/builds/worker/workspace/build/src/gtk3/usr/local/include/cairo -I/builds/worker/workspace/build/src/gtk3/usr/local/include/pango-1.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/include/atk-1.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/include/cairo -I/builds/worker/workspace/build/src/gtk3/usr/local/include/pixman-1 -I/builds/worker/workspace/build/src/gtk3/usr/local/include -I/builds/worker/workspace/build/src/gtk3/usr/local/include/gdk-pixbuf-2.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/include/glib-2.0 -I/builds/worker/workspace/build/src/gtk3/usr/local/lib/glib-2.0/include -I/builds/worker/workspace/build/src/gtk3/usr/include/freetype2 -I/builds/worker/workspace/build/src/gtk3/usr/include/libpng12 -fprofile-generate -MD -MP -MF .deps/Unified_cpp_xpcom_base0.o.pp /builds/worker/workspace/build/src/obj-firefox/xpcom/base/Unified_cpp_xpcom_base0.cpp [task 2018-01-15T11:49:56.531Z] 11:49:56 INFO - In file included from /builds/worker/workspace/build/src/obj-firefox/dist/include/nsPresArena.h:13:0, [task 2018-01-15T11:49:56.531Z] 11:49:56 INFO - from /builds/worker/workspace/build/src/obj-firefox/dist/include/nsIPresShell.h:38, [task 2018-01-15T11:49:56.531Z] 11:49:56 INFO - from /builds/worker/workspace/build/src/obj-firefox/dist/include/nsPresContext.h:19, [task 2018-01-15T11:49:56.531Z] 11:49:56 INFO - from /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/dom/Element.h:28, [task 2018-01-15T11:49:56.531Z] 11:49:56 INFO - from /builds/worker/workspace/build/src/dom/base/nsDOMMutationObserver.h:20, [task 2018-01-15T11:49:56.531Z] 11:49:56 INFO - from /builds/worker/workspace/build/src/xpcom/base/CycleCollectedJSContext.cpp:35, [task 2018-01-15T11:49:56.531Z] 11:49:56 INFO - from /builds/worker/workspace/build/src/obj-firefox/xpcom/base/Unified_cpp_xpcom_base0.cpp:20: [task 2018-01-15T11:49:56.532Z] 11:49:56 INFO - /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/ArenaAllocator.h: In member function 'void* mozilla::ArenaAllocator<ArenaSize, Alignment>::ArenaChunk::Allocate(size_t)': [task 2018-01-15T11:49:56.532Z] 11:49:56 INFO - /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/ArenaAllocator.h:180:7: error: 'canary' was not declared in this scope [task 2018-01-15T11:49:56.532Z] 11:49:56 INFO - canary.Check(); [task 2018-01-15T11:49:56.532Z] 11:49:56 INFO - ^~~~~~ [task 2018-01-15T11:49:56.534Z] 11:49:56 INFO - /builds/worker/workspace/build/src/config/rules.mk:1028: recipe for target 'Unified_cpp_xpcom_base0.o' failed [task 2018-01-15T11:49:56.534Z] 11:49:56 INFO - gmake: *** [Unified_cpp_xpcom_base0.o] Error 1 [task 2018-01-15T11:49:56.535Z] 11:49:56 INFO - gmake: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/xpcom/base' [task 2018-01-15T11:49:56.535Z] 11:49:56 INFO - /builds/worker/workspace/build/src/config/recurse.mk:73: recipe for target 'xpcom/base/target' failed [task 2018-01-15T11:49:56.535Z] 11:49:56 INFO - gmake: *** [xpcom/base/target] Error 2
Looks like we'll also need to uplift at least the patch for bug 1406727 comment 46. I'll let Matt request the required uplifts for that one.
Thanks for jumping on this! I've made the extra uplift request in bug 1406727.
Pushed by email@example.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/4bb6a59e797f Check the canary during allocations. r=jet
Backed out because dependency bug 1406727 (comment 23) hasn't been uplifted yet: https://hg.mozilla.org/releases/mozilla-release/rev/9373b4bc3dd7a187baf7da6f26cd8050e7c2e231 https://hg.mozilla.org/releases/mozilla-beta/rev/ae8dbca85bfec5b3e53e578b5c65b0effbeccdc2
Per email thread, this is not going to block the 58 release.
Comment on attachment 8942598 [details] [diff] [review] Check the canary during allocations this isn't going to be on 58 after all.
I'm seeing canary crashes in 59 from the last week: https://crash-stats.mozilla.com/report/index/735e9267-d1f6-4caa-b468-e43890180120 https://crash-stats.mozilla.com/report/index/2249364f-c1f8-4b79-b762-ff90f0180119 https://crash-stats.mozilla.com/report/index/3ce7dd54-644d-4e03-999e-31cdc0180119 Matt, do any of these shed light on the situation?
Flags: needinfo?(nfroyd) → needinfo?(matt.woodrow)
Moving to p3 because no activity for at least 24 weeks. See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P1 → P3
You need to log in before you can comment on or make changes to this bug.