Open Bug 1333035 Opened 3 years ago Updated 2 years ago

Crash in OOM | large | NS_ABORT_OOM | mozilla::cyclecollector::HoldJSObjectsImpl

Categories

(Core :: XPCOM, defect, P2, critical)

51 Branch
defect

Tracking

()

Tracking Status
firefox-esr45 --- unaffected
firefox50 --- unaffected
firefox51 --- wontfix
firefox52 --- fix-optional
firefox53 --- wontfix
firefox54 + wontfix
firefox55 - wontfix
firefox56 - ---
firefox59 --- wontfix
firefox60 --- affected
firefox61 --- ?

People

(Reporter: philipp, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-d76bd972-7ded-4fc6-a982-70e592170123.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	NS_ABORT_OOM(unsigned int) 	xpcom/base/nsDebugImpl.cpp:606
1 	xul.dll 	mozilla::cyclecollector::HoldJSObjectsImpl(nsISupports*) 	xpcom/base/HoldDropJSObjects.cpp:32
2 	xul.dll 	mozilla::dom::MessageEvent::InitMessageEvent(JSContext*, nsAString_internal const&, bool, bool, JS::Handle<JS::Value>, nsAString_internal const&, nsAString_internal const&, mozilla::dom::Nullable<mozilla::dom::WindowProxyOrMessagePort> const&, mozilla::dom::Nullable<mozilla::dom::Sequence<mozilla::OwningNonNull<mozilla::dom::MessagePort> > > const&) 	dom/events/MessageEvent.cpp:165
3 	xul.dll 	mozilla::dom::PostMessageEvent::Run() 	dom/base/PostMessageEvent.cpp:153
4 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1067
5 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:96
6 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:225
7 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp:156
8 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp:262
9 	xul.dll 	nsAppStartup::Run() 	toolkit/components/startup/nsAppStartup.cpp:283
10 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp:4401
11 	xul.dll 	XREMain::XRE_main(int, char** const, nsXREAppData const*) 	toolkit/xre/nsAppRunner.cpp:4534
12 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp:4625

this crash signature is primarily present on windows 32bit versions of the browser (there are some single reports from android as well). starting with 51 builds the signature started to grow in volume in pre-release channels - on 51.0b the signature was accounting for 0.13% of all crashes...
Correlations:

(36.78% in signature vs 01.38% overall) Addon "Flash Video Downloader - YouTube HD Download [4K]" = true [48.44% vs 01.61% if startup_crash = 0]
(75.86% in signature vs 28.25% overall) Module "qasf.dll" = true [68.89% vs 30.22% if platform_version = 6.1.7601 Service Pack 1]
(79.31% in signature vs 33.32% overall) Module "MP3DMOD.DLL" = true [73.33% vs 35.92% if platform_version = 6.1.7601 Service Pack 1]
(79.31% in signature vs 34.63% overall) Module "msdmo.dll" = true [73.33% vs 36.98% if platform_version = 6.1.7601 Service Pack 1]
The crash point is here:
https://dxr.mozilla.org/mozilla-release/rev/9a5e1a8c8ebb4fe61386fadb6c52464430e58c54/xpcom/glue/nsBaseHashtable.h#126

The entry type is a pointer. For one of the report bp-41453287-d13a-4461-a9a9-a5cd32170120 has OOM Allocation Size 	194,988,672 bytes (185.96 MB) it means the hash table has about 48,747,168 entries. The report also contains memory report, some interesting part:

2,711.84 MB (100.0%) -- explicit
├──1,502.06 MB (55.39%) -- window-objects
│  ├────254.72 MB (09.39%) ++ top(<anonymized-91>, id=91)
│  ├────254.26 MB (09.38%) ++ top(<anonymized-114>, id=114)
│  ├────245.66 MB (09.06%) ++ top(<anonymized-134>, id=134)
│  ├────242.53 MB (08.94%) ++ top(<anonymized-242>, id=242)
│  ├────240.44 MB (08.87%) ++ top(<anonymized-153>, id=153)
│  ├────240.29 MB (08.86%) ++ top(<anonymized-264>, id=264)
│  └─────24.16 MB (00.89%) ++ (7 tiny)
├────867.73 MB (32.00%) ── heap-unclassified
├────193.83 MB (07.15%) -- xpconnect
│    ├──192.12 MB (07.08%) ── runtime
│    └────1.72 MB (00.06%) ++ (4 tiny)
├─────72.50 MB (02.67%) -- js-non-window
│     ├──40.49 MB (01.49%) ++ (2 tiny)
│     └──32.01 MB (01.18%) ++ runtime
├─────44.27 MB (01.63%) ++ heap-overhead
└─────31.46 MB (01.16%) ++ (19 tiny)

Note the 867.73 MB heap-unclassified, we may want to add memory reporter to CycleCollectedJSContext

It also has very high number of event listeners

13,527,644 (100.0%) -- event-counts
└──13,527,644 (100.0%) -- window-objects
   ├───2,331,180 (17.23%) -- top(<anonymized-91>, id=91)/active
   │   ├──2,331,079 (17.23%) -- window(<anonymized-105>)/dom
   │   │  ├──2,330,175 (17.23%) ── event-listeners
   │   │  └────────904 (00.01%) ── event-targets
   │   └────────101 (00.00%) ++ (8 tiny)
   ├───2,331,039 (17.23%) -- top(<anonymized-114>, id=114)/active
   │   ├──2,330,938 (17.23%) -- window(<anonymized-125>)/dom
   │   │  ├──2,330,146 (17.23%) ── event-listeners
   │   │  └────────792 (00.01%) ── event-targets
   │   └────────101 (00.00%) ++ (8 tiny)
   ├───2,253,908 (16.66%) -- top(<anonymized-134>, id=134)/active
   │   ├──2,253,806 (16.66%) -- window(<anonymized-144>)/dom
   │   │  ├──2,253,042 (16.66%) ── event-listeners
   │   │  └────────764 (00.01%) ── event-targets
   │   └────────102 (00.00%) ++ (8 tiny)
   ├───2,215,805 (16.38%) -- top(<anonymized-242>, id=242)/active
   │   ├──2,215,704 (16.38%) -- window(<anonymized-253>)/dom
   │   │  ├──2,214,996 (16.37%) ── event-listeners
   │   │  └────────708 (00.01%) ── event-targets
   │   └────────101 (00.00%) ++ (8 tiny)
   ├───2,203,160 (16.29%) -- top(<anonymized-153>, id=153)/active
   │   ├──2,203,060 (16.29%) -- window(<anonymized-164>)/dom
   │   │  ├──2,202,382 (16.28%) ── event-listeners
   │   │  └────────678 (00.01%) ── event-targets
   │   └────────100 (00.00%) ++ (7 tiny)
   ├───2,190,749 (16.19%) -- top(<anonymized-264>, id=264)/active
   │   ├──2,190,648 (16.19%) -- window(<anonymized-275>)/dom
   │   │  ├──2,189,828 (16.19%) ── event-listeners
   │   │  └────────820 (00.01%) ── event-targets
   │   └────────101 (00.00%) ++ (8 tiny)
   └───────1,803 (00.01%) ++ (7 tiny)
any idea why this table can grow to so large?
Flags: needinfo?(continuation)
Component: General → XPCOM
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #3)
> any idea why this table can grow to so large?

Every cycle collected object that has a pointer to a JS object must be added to this table. Maybe some new class was cycle collected in 51, and this addon is creating a lot of them? Alternatively, maybe we are somehow failing to remove entries from the table.
Flags: needinfo?(continuation)
Depends on: 1333499
Depends on: 1333502
I see 150 crash reports. Of these, 97 are being called from WorkerGlobalScopeBinding::setTimeout, and 20 from MessageEvent::InitMessageEvent. Could a worker be creating a lot of timeouts, but we never run them, so they build up?
oh, hey, we have a bug somewhere that GC/CC may not run often enough in workers. 
IIRC, we cancel the timer to run them when something else starts to run or something.
Thanks, Andrew and Olli. Let's see if bug 1216175 can also fix this.
See Also: → 1216175
Too late for a fix for 53, as we are in the last week of the 53 beta cycle.
[Tracking Requested - why for this release]:

I don't think regression is quite correct here, so removing that.

That being said, it would be nice to get this fixed via bug 1216175 if possible.  People are only going to start using workers more and more and the # of crashes in the last week are similar to other bugs that we're tracking, so...
Keywords: regression
Priority: -- → P2
Mark 54 fix-optional as the volume of crashes for 54 is low at this moment.
There isn't a lot of point in tracking stuff like this, if it's low volume on release, we can't get someone to take on the bug, and it it isn't a security issue. Let's drop this and count on progress from bug 1216175.
the user at bug 1451250 seems to be able to reproduce the problem with a particular webapp - could he run some troubleshooting steps to further debug and help us understand the issue?
mccr8 is the CC expert.
Flags: needinfo?(nfroyd) → needinfo?(continuation)
The OOM allocation size was 3GB, so I don't think we really expect this table to work in that circumstance. I passed along the needinfo to baku in that bug, who might have some ideas about why we are accumulating so many timeouts on workers.
Flags: needinfo?(continuation)
You need to log in before you can comment on or make changes to this bug.