Open Bug 1488480 Opened 1 year ago Updated 1 year ago

Crash in nsGlobalWindowInner::InnerSetNewDocument

Categories

(Core :: DOM: Core & HTML, defect, P2, critical)

63 Branch
defect

Tracking

()

Tracking Status
firefox-esr60 --- fix-optional
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- fix-optional
firefox68 --- fix-optional

People

(Reporter: philipp, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

This bug was filed from the Socorro interface and is
report bp-7ce99a53-04f2-4fbd-bf54-225ed0180902.
=============================================================

Top 10 frames of crashing thread:

0 xul.dll nsGlobalWindowInner::InnerSetNewDocument dom/base/nsGlobalWindowInner.cpp:1690
1 xul.dll nsGlobalWindowOuter::SetNewDocument dom/base/nsGlobalWindowOuter.cpp:1941
2 xul.dll nsresult nsDocumentViewer::InitInternal layout/base/nsDocumentViewer.cpp:1033
3 xul.dll nsDocumentViewer::Init layout/base/nsDocumentViewer.cpp:772
4 xul.dll nsresult nsDocShell::Embed docshell/base/nsDocShell.cpp:6674
5 xul.dll nsresult nsDocShell::CreateAboutBlankContentViewer docshell/base/nsDocShell.cpp:7553
6 xul.dll nsresult nsDocShell::EnsureContentViewer docshell/base/nsDocShell.cpp:7404
7 xul.dll nsDocShell::GetDocument docshell/base/nsDocShell.cpp:3928
8 xul.dll nsGlobalWindowOuter::WrapObject dom/base/nsGlobalWindowOuter.h:236
9 xul.dll XPCConvert::NativeInterface2JSObject js/xpconnect/src/XPCConvert.cpp:988

=============================================================

crashes with this signature have been around for a while but are regressing in volume during the 63.0a1 nightly cycle. they're mostly crashes in the content process with moz crash reason "MOZ_CRASH(Unhandlable OOM while clearing document dependent slots.)"

the first nightly where this increased crash volume was apparent was 63.0a1 build 20180711100118. the changelog of patches to the build before would be https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=70f901&tochange=aff060
Component: General → DOM
Maybe this is a signature change for bug 1405521?

IIRC there was some other OOM signature that increased in 63.
See Also: → 1405521
Priority: -- → P2
I don't see an assignee on this or 1405521. Are they actively being worked on?
Flags: needinfo?(continuation)
I don't think so. This seems like an OOM so I'm not sure what we can do. It is odd that it spiked up recently. I didn't see any corresponding decrease in the other bug.

Any ideas, Boris? You had a long comment in the other bug about this code. Thanks.
Flags: needinfo?(continuation) → needinfo?(bzbarsky)
This does basically seem like a signature change for bug 1405521.  Is that one dropping off as this one increases?

The stack linked in comment 0 is actually pretty complete.  It's showing us running some JS that adds a node to the DOM (presumably some subtree containing an iframe), which kicks off nsFrameLoader::ReallyStartLoadingInternal, which runs some chrome JS, which tries to touch the window in that docshell, so we CreateAboutBlankContentViewer and as part of that setup hit this assertion.

Depending on what the JS that adds a node was doing and how deep its callstack was, we _could_ be hitting recursion limits here, I suppose...

I wonder whether it would make sense to add more informative MOZ_CRASH bits inside JSCompartment::wrap to see whether one of the things in there is failing and if so which one.  And the same for the other bits bug 1405521 comment 1 mentions.
Flags: needinfo?(bzbarsky)
mine bp-762b3628-c652-4375-ba18-e47320180915 but that's with a one month old nightly
Right, I'm going to see whether I can get more info from the instrumentation I added.
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #6)
> Right, I'm going to see whether I can get more info from the instrumentation
> I added.

Boris, did your instrumentation provide information that could help fix this crash? Thanks
Flags: needinfo?(bzbarsky)
We have some information, but still trying to make sense of it.  Things that would be really helpful:

1)  An answer to my question from the beginning of comment 4.

2)  Figuring out whether the crashes under Window_Binding::get_document in builds with bug 1491313 in them happen at the same frequency as the crashes with this bug number in the crash reason in builds with bug 1493849 in them.

For both of these questions, my main issue is not knowing how to extract crash frequencies from crash-stats.

See bug 1493849 comment 8 for our current understanding of the situation here.
Flags: needinfo?(bzbarsky)
Regarding the answer to Comment 4, nsGlobalWindow::ClearDocumentDependentSlots signature is gone as of 58.0.2. nsGlobalWindowInner::ClearDocumentDependentSlots signature has a few crashes in 63, but is barely noticeable. That signature continues in 62.0.2 release, but there are only 500 crashes in the last 7 days. So the theory that this signature morphed in some way from Bug 1405521 I think is a valid one.

I will try to see if I can obtain an answer to #2. One thing I noticed about the crashes in mozilla::dom::Window_Binding::get_document after the instrumentation landed is that the two URLs that show up were both behind what appeared to be Powered by Atlassian sites.
Many of us have been using https://addons.mozilla.org/en-US/firefox/addon/bugzilla-crash-stop/ to see crash frequency - it probably doesn't solve all of what we are trying to get out of Comment 8, but nevertheless is helpful in seeing daily crashes per signature.
> This does basically seem like a signature change for bug 1405521.  Is that one dropping off as this one increases?
yes that may be the case, [@ nsGlobalWindowInner::ClearDocumentDependentSlots] was dropping of as a somewhat common crash (1-2 crashes per build) after nightly build 20180709221247 though there were still single occurrences afterwards and [@ nsGlobalWindowInner::InnerSetNewDocument] started taking off two days later in build 20180711100118 (with ~2-5 crashes per nightly build).

> Figuring out whether the crashes under Window_Binding::get_document in builds with bug 1491313 in them happen at the same frequency as the crashes with this bug number in the crash reason in builds with bug 1493849 in them.

i hope i understand the request correctly.
*there are only 4 reports from builds with bug 1491313 in it & Window_Binding::get_document:
https://crash-stats.mozilla.com/search/?signature=~Window_Binding%3A%3Aget_document&build_id=%3E%3D20180919003800&release_channel=nightly&date=%3E%3D2018-08-01#crash-reports
*there are 9 reports from builds with bug 1493849 in it and a moz crash reason containing "1488480":
https://crash-stats.mozilla.com/search/?moz_crash_reason=~1488480&build_id=%3E%3D20180925220052&release_channel=nightly&date=%3E%3D2018-08-01#crash-reports
> nsGlobalWindow::ClearDocumentDependentSlots signature is gone as of 58.0.2

Right, because the nsGlobalWindow class got split up into separate nsGlobalWindowInner/Outer classes.   nsGlobalWindowInner::ClearDocumentDependentSlots is what the new name for the function is.

> Many of us have been using https://addons.mozilla.org/en-US/firefox/addon/bugzilla-crash-stop/ 

Thank you.  Unfortunately, I am trying to break down this data by which nightly the crash is in, which doesn't  seem like something the add-on really handles.

> *there are only 4 reports from builds with bug 1491313 in it & Window_Binding::get_document:

There are actually 16, but crash-stats has kinda messed up signatures for the other 12.  See <https://crash-stats.mozilla.com/search/?signature=~get_document&build_id=%3E%3D20180919003800&release_channel=nightly&date=%3E%3D2018-08-01#crash-reports>.

But really, counting number of reports without knowing how many users of that nightly exist is a weird exercise.  What we want is crashes per hour of use or something.

That said, the numbers here are small enough I'm not sure we can get any useful statistical signal out anyway.  :(

> *there are 9 reports from builds with bug 1493849 in it and a moz crash reason containing "1488480":

Right, over the course of almost a week.

I should probably put the "Looks like bug 1488480/1405521, with WrapObject() on nsIDocument throwing" MOZ_CRASH back and see whether it ever happens or whether we always hit one of the other crashes first.   It should be the latter.

I will work on that and some other instrumentation, I guess.
OK.  So at this point, on trunk, all the crashes under ClearDocumentDependentSlots seem to be caused by this MOZ_CRASH:

  MOZ_CRASH(Looks like bug 1488480/1405521, with aCreator failing to create the per-interface object)

in HTMLDocument_Binding::CreateInterfaceObjects.  This comes from GetPerInterfaceObjectHandle being called with aSlotId either prototypes::id::HTMLDocument or prototypes::id::Document.  

Since we're getting here via HTMLDocument_Binding::CreateInterfaceObjects in all the cases, this is specifically the Document_Binding::GetProtoObjectHandle call failing, which passes Document_Binding::CreateInterfaceObjects for aCreator...

Now why that fails, I'm not sure yet.  I guess I will try adding more instrumentation to CreateInterfaceObjects methods and seeing what gets hit.

Looking at the actual crash reports, some do look like OOM (e.g.  https://crash-stats.mozilla.com/report/index/546181a3-2385-40e0-8795-3cef30181011 has 5MB free memory).   But something like https://crash-stats.mozilla.com/report/index/f0dfc055-0b3b-411c-adf6-5e26b0181014 certainly doesn't look like OOM to me...
At this point, crash-stats points to JS_DefineProperties on Document.prototype as the thing that consistently fails when we hit this situation.

https://crash-stats.mozilla.com/signature/?product=Firefox&version=65.0a1&proto_signature=ClearDocumentDependentSlots&signature=mozilla%3A%3Adom%3A%3ADefinePrefable%3CT%3E&date=%3E%3D2018-11-12T04%3A45%3A02.000Z&date=%3C2018-11-19T04%3A45%3A02.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports is a list of reports.  I looked through them, and they're not obviously short on memory, so if it's OOM it's not due to the OS running out of RAM...
I'd really appreciate it if someone on the JS team could look into why JS_DefineProperties sometimes fails here, even though there is plenty of available RAM.
Flags: needinfo?(sdetar)
Flags: needinfo?(jdemooij)
All the crashes in Comment 14 have JSOutOfMemory:Reported. Looking at a crash with low system memory percentage I see that the machine has 32GB of ram and the content process has >7GB of private 1 MB chunks (of which some are GC chunks and others are full jemalloc arenas).

In [1], jonco points out the JSGC_MAX_BYTES is set to UINT32_MAX. Looking at code in [2] there is a comment about this value being inifinity, but inside [3] we end up implicitly casting to size_t so I'm not sure infinity is the case.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1405521#c3
[2] https://searchfox.org/mozilla-central/rev/8f89901f2d69d9783f946a7458a6d7ee70635a94/js/xpconnect/src/XPCJSRuntime.cpp#3058-3064
[3] https://searchfox.org/mozilla-central/rev/8f89901f2d69d9783f946a7458a6d7ee70635a94/js/src/gc/GC.cpp#1441
Jon, do we effectively have a 4GB GC heap limit? (That does sounds pretty reasonable...)

Also, bz, maybe in before the forced crash we should check if the error is a pending exception directly?
Flags: needinfo?(jcoppeard)
We can probably check whatever we want there... What specifically do you want to check?
I can write a patch to split the MOZ_CRASH in separate ones for:

(1) No pending exception.
(2) Pending exception is an object.
(3) Pending exception is "out of memory" string.
(4) Everything else.

That should at least help us confirm it's OOM related.
Flags: needinfo?(sdetar)
Depends on: 1509269
(In reply to Ted Campbell [:tcampbell] from comment #16)
> All the crashes in Comment 14 have JSOutOfMemory:Reported. Looking at a
> crash with low system memory percentage I see that the machine has 32GB of
> ram and the content process has >7GB of private 1 MB chunks (of which some
> are GC chunks and others are full jemalloc arenas).

Hm I missed this comment. That makes bug 1509269 a bit less useful probably...
(In reply to Ted Campbell [:tcampbell] from comment #17)
Yes we do.  Although maybe we should relax this as people seem to be using machines with e.g. 32GB of RAM.
Flags: needinfo?(jcoppeard)
So right, this definitely looks like OOM, crashes have JSReportedOOM: Reported in the metadata tab...
Depends on: 1509318
Flags: needinfo?(jdemooij)
OK.  I'm going to back out the various instrumentation I added, since it doesn't look like there's much point to it now.
Seems unlikely to be fixed in 64 at this point. Should we count this as stalled?

(In reply to Jon Coppeard (:jonco) from comment #21)

(In reply to Ted Campbell [:tcampbell] from comment #17)
(4GB limit)
Yes we do. Although maybe we should relax this as people seem to be using
machines with e.g. 32GB of RAM.

So is that happening? Should this bug morph into a discussion of that question, or is it happening elsewhere?

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #25)
Bug 1509318 is about supporting higher limits in the JS engine. Maybe this bug could be used to discuss whether that's the right solution here.

Flags: needinfo?(jcoppeard)
Component: DOM → DOM: Core & HTML

Bulk change for all regression bugs with status-firefox67 as 'fix-optional' to be marked 'affected' for status-firefox68.

Seems to be approx same crash rate as several months ago.
bp-604460dd-d91c-4943-a0f6-677fd0190602 my crash from 67.0

You need to log in before you can comment on or make changes to this bug.