Closed Bug 1597965 Opened 5 years ago Closed 5 years ago

Crash in [@ nsWrapperCache::GetWrapper]

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla72
Tracking Status
firefox-esr68 --- unaffected
firefox70 --- unaffected
firefox71 --- unaffected
firefox72 blocking fixed

People

(Reporter: marcia, Assigned: emilio)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-c7fcf926-a6aa-4195-a35c-8c1af0191116.

Seen while looking at nightly crashes - started spiking in 72 in 20191114214957: https://mzl.la/2CYJf1Z. 165 crashes/47 installs so far. Noticeable spike in the build from 11/20. No particular pattern to the URLs and comment are not really useful.

Possible regression range based on build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2f19e7b646e0a52fa855b75c868f0a3f3a990ad3&tochange=88db9bea4580df16dc444668f8c2cddbb3414318

Top 7 frames of crashing thread:

0 xul.dll nsWrapperCache::GetWrapper dom/base/nsWrapperCacheInlines.h:27
1 xul.dll static bool mozilla::dom::Window_Binding::getWindowGlobalChild dom/bindings/WindowBinding.cpp:7409
2 xul.dll mozilla::dom::binding_detail::GenericMethod<mozilla::dom::binding_detail::MaybeCrossOriginObjectThisPolicy, mozilla::dom::binding_detail::ThrowExceptions> dom/bindings/BindingUtils.cpp:3154
3 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:548
4 xul.dll js::jit::DoCallFallback js/src/jit/BaselineIC.cpp:2940
5  @0x312b97b3f9e 
6 xul.dll trunc 

I have had this happen to me today and have had 4 crashes in the last 40 minutes. There is no definitive steps to reproduce for me

Looking at that regression range, I think bz and emilio have modified code that is at least vaguely related to that. ni? to them to see if there's something obvious, or if something in that regression range jumps out to them whereas it doesn't to me.

Flags: needinfo?(emilio)
Flags: needinfo?(bzbarsky)

Noticeable spike in the build from 11/20.

That's because, as of the fix for bug 1371390, these crashes have started to become visible on the Mac :-)

I've just turned off updates on nightly based on the crash volume in the latest build.

(Following up comment 3)

Actually probably not. These crashes are visible on the Mac for the first time in the 20191120094758 mozilla-central nightly. But the spike on all platforms (currently 691) is much greater than the spike on the Mac (138).

Severity: normal → major

I'm having this problem with Firefox nightly 20191120094758 on Ubuntu Linux 19.04. I wasn't having this problem until I updated to the latest nightly.

I can reproduce it at will by opening any Google doc. As soon as the Google doc finishes loading, the tab crashes.

Here's one of my crash reports:

https://crash-stats.mozilla.org/report/index/b3f3dfc0-d2fc-4daf-be3d-9e8d40191120

Nightly updates are now rolled back to 20191119215132 and re-enabled.

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #6)

I'm having this problem with Firefox nightly 20191120094758 on Ubuntu Linux 19.04. I wasn't having this problem until I updated to the latest nightly.

I can reproduce it at will by opening any Google doc. As soon as the Google doc finishes loading, the tab crashes.

I couldn't repro on a simple google doc (or at all so far) on a mozilla-central build. Do you have any add-ons installed or such that could help me repro?

So far the stacks point to a null windowGlobalChild(), so it is more likely that some of the fission changes triggered this than my patches, or bz's patches.

I think this should be nullable: https://searchfox.org/mozilla-central/rev/652014ca1183c56bc5f04daf01af180d4e50a91c/dom/webidl/Window.webidl#552

But probably existing code didn't trigger it.

Looking a bit into it.

Flags: needinfo?(emilio)

It can be null after FreeInnerObjects.

Though it looks a bit spooky for chrome code to poke at a window in that state?
We should at least figure out which change made this such a high-volume crash.

This should unblock nightly updates for now... I can add an assertion when
called from DOM bindings if you want, so that we catch this on debug builds at
least?

Is that regression range correct? It seems from 14th of november and such. so quite a bit ago.

Flags: needinfo?(mozillamarcia.knous)

Bugs that have recently introduced calls to getWindowGlobalChild():

Of course the issue could be in other patches too... Bug 1577498 seems like it could be the culprit?

(In reply to Emilio Cobos Álvarez (:emilio) from comment #10)

Is that regression range correct? It seems from 14th of november and such. so quite a bit ago.

There were one off crashes started on 11/14, but the large spike which just presented itself is in 20191120094758. So that means it could be something that landed in https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=fdd07df83c87f12725f4b97c80e644fd11673977&tochange=79821df172391d2d9ab224951b36bd8856df0fb1 (changes between 19th and 20th).

Flags: needinfo?(mozillamarcia.knous)

I think this should be nullable: https://searchfox.org/mozilla-central/rev/652014ca1183c56bc5f04daf01af180d4e50a91c/dom/webidl/Window.webidl#552

Very much so, given how it's handled in FreeInnerObjects.

Though it looks a bit spooky for chrome code to poke at a window in that state?

It's not really that hard. As an example, if you are poking a window from an iframe that was removed from the DOM, it's in that state.

I wish we got JS stacks from crash-stats; something like https://crash-stats.mozilla.org/report/index/b3f3dfc0-d2fc-4daf-be3d-9e8d40191120 would be legible with those. As it stands, what I can see is that we are firing DOMContentLoaded, which queues a Promise resolution or rejection, which then runs the Promise handler, which then calls PrecompiledScript::ExecuteInGlobal (and note, that at this point if the global is the window involved it could already have had its FreeInnerObjects called by some other DOMContentLoaded listener!) and then the chrome script pokes at getWindowGlobalChild().

Bug 1577498 seems like it could be the culprit?

Well, it's definitely in the regression range from comment 12.

So, we can definitely make getWindowGlobalChild() nullable, but then consumers need to check for null and deal with it, or at least check that they are OK with exceptions. Now luckily there just aren't that many consumers...

Tomislav, can any of the consumers added in bug 1577498 run at a point after the window's docshell has been torn down, do you know?

Flags: needinfo?(bzbarsky) → needinfo?(tomica)

Yeah, this most definitely looks like it's getting triggered from bug 1577498. PrecompiledScript::ExecuteInGlobal is how we run content scripts, and often behind the DOMContentLoaded.

I'll dig in some more, but I believe returning null in that case would be fine, and even more expected from my point of view.

Pushed by ealvarez@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c09a360d3bd3
Window.getWindowGlobalChild() should be nullable. r=bzbarsky

So yes, turns out this is executing lazily after we're already in the content script, the first time they try to access the the browser api exposed inside the sandbox. Since we're already in extension code by that point, it's very possible that docShell was very much alive when we decided to inject the content script, and got killed in the meantime.

If getWindowGlobalChild() starts returning null at that point, the lazy getter will just throw, and likely abort the whole content script (or at least neuter it, since we'll never expose the api endpoint). This is almost certainly a better outcome than what would currently happen in that case (before bug 1577498).

(In reply to Michal Kluka from comment #14)

This page will always crash for me https://www.postoj.sk/48952/za-ciarou?fbclid=IwAR2vQ_E8gmBejfAr7zOfnUZy0THId26tkjFQgxCcQ_Xe1WY6Ln6alwSZMrw

Hey Michal, sorry for marking your comment "off-topic", it's too long and was making the discussion hard to follow, and I don't know how else to collapse it.

Flags: needinfo?(tomica)
See Also: → 1577498
Blocks: 1598129
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Assignee: nobody → emilio

That's really odd. That nightly seems to be build from rev 66531295716a76515be6e24774d788dc4ed8ecdf which should have this fix...

As best I can tell, the patch for this bug (https://phabricator.services.mozilla.com/D53967) landed first in the 20191120215217 mozilla-central nightly, whose rev is e76dbab2aea8354660281221c1aa08356107881c:

https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=e76dbab2aea8354660281221c1aa08356107881c

Since it landed there've been a number of these crashes, though not nearly so many as with the 20191120094758 nightly:

https://crash-stats.mozilla.org/search/?build_id=%3E%3D20191120215217&signature=~nsWrapperCache%3A%3AGetWrapper&date=%3E%3D2019-11-20T16%3A26%3A00.000Z&date=%3C2019-11-21T16%3A26%3A00.000Z&_facets=signature&_facets=platform_version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Interestingly, the vast majority are on "6.1.7601 Service Pack 1". And only one isn't on Windows (it's on macOS 10.15.1).

There are a number of crashes with AsyncShutdownTimeout in the signature, most (all?) of which also (like this bug's crashes) have PromiseReactionJob() on the stack, and are also disproportionately on Windows. Possibly related?

https://crash-stats.mozilla.org/search/?signature=~AsyncShutdownTimeout&date=%3E%3D2019-11-14T18%3A23%3A00.000Z&date=%3C2019-11-21T18%3A23%3A00.000Z&_facets=signature&_facets=platform_version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Pretty much all the crashes on builds after this fix have a null install date, so I'd take them with a grain of salt. Is it possible they're actually crashes from the broken build but we're making up some of the metadata on submission?

Flags: needinfo?(gsvelto)

(In reply to Julien Cristau [:jcristau] from comment #23)

Pretty much all the crashes on builds after this fix have a null install date, so I'd take them with a grain of salt. Is it possible they're actually crashes from the broken build but we're making up some of the metadata on submission?

Yeah, if they're missing the install time then they're orphaned crashed and the metadata is unreliable (we synthesize the bare minimum from the installation that's submitting them). Orphaned crashes are sent in batches when they're found so that might be why you're seeing a spike, it should go away quickly.

Flags: needinfo?(gsvelto)

You're right, Gabriele. These crashes are completely gone as of the 20191123094742 trunk nightly.

Flags: qe-verify+

Please specify a root cause for this bug. See :tmaity for more information.

Root Cause: --- → ?
Root Cause: ? → Coding: Logical Error
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: