crash in nsGlobalWindow::SetNewDocument

RESOLVED FIXED in mozilla20

Status

()

defect
--
critical
RESOLVED FIXED
7 years ago
5 months ago

People

(Reporter: scoobidiver, Unassigned)

Tracking

(4 keywords)

15 Branch
mozilla20
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox14 unaffected, firefox15+ wontfix, firefox16- affected, firefox17- affected)

Details

(crash signature, )

Attachments

(3 attachments)

It's #40 top browser crasher in 15.0b1.
It was a low volume crash across release builds but it first appeared for Nightly builds in 15.0a1/20120509.

It's slightly correlated to Crossrider Apps and Babylon:
* July 20:      36% (4/11) vs.   0% (4/3022) crossriderapp4639@crossrider.com
                45% (5/11) vs.   5% (145/3022) ffxtlbr@babylon.com
* July 21:      33% (7/21) vs.   3% (329/11823) crossriderapp2258@crossrider.com
* July 22:      18% (6/34) vs.   3% (791/27771) crossriderapp2258@crossrider.com
                12% (4/34) vs.   0% (4/27771) crossriderapp4982@crossrider.com
                24% (8/34) vs.   7% (2018/27771) ffxtlbr@babylon.com
* July 23:      28% (9/32) vs.   1% (140/27835) crossriderapp5060@crossrider.com
                22% (7/32) vs.   7% (2068/27835) ffxtlbr@babylon.com

Signature 	nsGlobalWindow::SetNewDocument(nsIDocument*, nsISupports*, bool) More Reports Search
UUID	3c207311-5e81-496b-bee5-2b0f72120722
Date Processed	2012-07-22 20:27:04
Uptime	2510
Last Crash	2.8 weeks before submission
Install Age	41.8 minutes since version was first installed.
Install Time	2012-07-22 19:45:04
Product	Firefox
Version	17.0a1
Build ID	20120722030555
Release Channel	nightly
OS	Windows NT
OS Version	6.1.7601 Service Pack 1
Build Architecture	x86
Build Architecture Info	AuthenticAMD family 16 model 4 stepping 3
Crash Reason	EXCEPTION_ACCESS_VIOLATION_READ
Crash Address	0x154
App Notes 	
AdapterVendorID: 0x10de, AdapterDeviceID: 0x06cd, AdapterSubsysID: 115319da, AdapterDriverVersion: 9.18.13.448
D2D? D2D+ DWrite? DWrite+ D3D10 Layers? D3D10 Layers+ 
EMCheckCompatibility	True
Adapter Vendor ID	0x10de
Adapter Device ID	0x06cd
Total Virtual Memory	4294836224
Available Virtual Memory	3507171328
System Memory Use Percentage	26
Available Page File	14754906112
Available Physical Memory	6320115712

Frame 	Module 	Signature 	Source
0 	xul.dll 	nsGlobalWindow::SetNewDocument 	dom/base/nsGlobalWindow.cpp:1877
1 	xul.dll 	DocumentViewerImpl::InitInternal 	layout/base/nsDocumentViewer.cpp:926
2 	xul.dll 	DocumentViewerImpl::Close 	layout/base/nsDocumentViewer.cpp:1429
3 		@0xcb4638f 	
4 	xul.dll 	nsDocShell::Embed 	docshell/base/nsDocShell.cpp:5907
5 	xul.dll 	nsDocShell::CreateAboutBlankContentViewer 	docshell/base/nsDocShell.cpp:6643
6 	xul.dll 	nsDocShell::CreateAboutBlankContentViewer 	docshell/base/nsDocShell.cpp:6661
7 	xul.dll 	nsGlobalWindow::SetOpenerScriptPrincipal 	dom/base/nsGlobalWindow.cpp:1529
8 	xul.dll 	nsWindowWatcher::OpenWindowJSInternal 	embedding/components/windowwatcher/src/nsWindowWatcher.cpp:863
9 	xul.dll 	nsWindowWatcher::OpenWindow 	embedding/components/windowwatcher/src/nsWindowWatcher.cpp:381
10 	xul.dll 	NS_InvokeByIndex_P 	xpcom/reflect/xptcall/src/md/win32/xptcinvoke.cpp:70
11 	xul.dll 	XPCWrappedNative::CallMethod 	js/xpconnect/src/XPCWrappedNative.cpp:2382
12 	xul.dll 	XPC_WN_CallMethod 	js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1474
13 	mozjs.dll 	js::InvokeKernel 	js/src/jsinterp.cpp:345
14 	mozjs.dll 	js::Interpret 	js/src/jsinterp.cpp:2426
15 	mozjs.dll 	js::InvokeKernel 	js/src/jsinterp.cpp:356
16 	mozjs.dll 	js::Invoke 	js/src/jsinterp.cpp:388
17 	mozjs.dll 	JS_CallFunctionValue 	js/src/jsapi.cpp:5566
18 	xul.dll 	nsXPCWrappedJSClass::CallMethod 	js/xpconnect/src/XPCWrappedJSClass.cpp:1436
19 	xul.dll 	nsXPCWrappedJS::CallMethod 	js/xpconnect/src/XPCWrappedJS.cpp:580
20 	xul.dll 	PrepareAndDispatch 	xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:85
21 	xul.dll 	SharedStub 	xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:112
22 	xul.dll 	DocumentViewerImpl::PermitUnload 	layout/base/nsDocumentViewer.cpp:1159

More reports at:
https://crash-stats.mozilla.com/report/list?signature=nsGlobalWindow%3A%3ASetNewDocument%28nsIDocument*%2C+nsISupports*%2C+bool%29
Duplicate of this bug: 777579
More reports also at:
https://crash-stats.mozilla.com/report/list?signature=nsGlobalWindow::SetNewDocument
Crash Signature: [@ nsGlobalWindow::SetNewDocument(nsIDocument*, nsISupports*, bool)] → [@ nsGlobalWindow::SetNewDocument(nsIDocument*, nsISupports*, bool)] [@ nsGlobalWindow::SetNewDocument]
OS: Windows XP → All
Hardware: x86 → All
If I recall correctly I think it was in the process of saving a draft copy of my wordpress blog.

I had a lot of other windows open as well, so it could have potentially been stuff running in the background.
Kyle, didn't you hack SetNewDocument recently?
The stacks are useless here.  Socorro is not linkifying them for some reason.
These crashes are happening on a line modified in Bug 730208, which landed for 15.

sfink is on vacation though ...
Tracking this regressing crasher for 15 since it started in 15, assigning to dmandelin to see about getting someone to help on this since sfink is on vacation.
Assignee: nobody → dmandelin
This looks like just an NPE, so here's a patch to check for null. I suspect this is not really the right thing to do, because the assertion a few lines up implies that currentInner == nullptr is not expected inside this if, but I also don't see anything above that would prevent that from happening. Bobby, do you think we should patch this null check, or would that just mask a bug in SetNewDocument or one of its callers?
Attachment #648779 - Flags: review?(bobbyholley+bmo)
Comment on attachment 648779 [details] [diff] [review]
Patch, just check for null

So, the NPE is occurring in the branch were we decided to reUseInnerWindow. The code paths taking us here don't pass aForceReuseInnerWIndow AFAICT, so the fact that we're getting here means that WouldReuseInnerWindow() returned true. But this means that mDoc must be non-null.

So we're getting into a situation where we've got a non-null mDoc but  a null mInnerWindow. This means that an earlier call to SetNewDocument probably did an exceptional early-return,

between here:
http://hg.mozilla.org/mozilla-central/file/3199bc043da4/dom/base/nsGlobalWindow.cpp#l1845
and here:
http://hg.mozilla.org/mozilla-central/file/3199bc043da4/dom/base/nsGlobalWindow.cpp#l1968

The most likely cause is that CreateNativeGlobalForInner failed. We assert against this, so I think it should be considered a bug until we understand it. CreateNativeGlobalForInner does a lot of stuff though, so I'm totally willing to believe that there's something fallible in there.

The wallpaper fix is to just check for a null mInnerWindow in WouldReuseInnerWindow (which is why I'm rminusing the attached patch). The current fix involves determining why we're early-returning, which probably requires STR.
Attachment #648779 - Flags: review?(bobbyholley+bmo) → review-
Ms2ger also suggested nulling out mDoc if CreateNativeGlobalForInner fails, which might be more robust (but more nasty cleanup code - maybe an RAII class?)

I'm starting to think this is related to bug 777875. They both appeared around the same time. I'm guessing somebody landed a patch somewhere that caused us to start tripping the assertion here in non-teardown cases:
http://hg.mozilla.org/mozilla-central/rev/712bca8b8674#l1.39

This would cause CreateNativeGlobalForInner to fail. Can we maybe get regression windows and see what everything is correlated with?
(In reply to Bobby Holley (:bholley) from comment #10)
> Can we maybe get regression windows and see what everything is correlated with?
Without STR, it will be hard as it's discontinuous across builds. It might be:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=1092c1a3ac50&tochange=dd29535bac5f
or
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e794cef56df6&tochange=642d1a36702f
Given that I'm guessing this is also causing the top-orange in bug 777875, flagging qawanted to get some STR. Not sure whether it's more productive to start from the random (but frequent) mochitest oranges, or from the crashes.
Keywords: qawanted
Thanks for the analysis. I think I'm not the best person to fix the actual bug.
Assignee: dmandelin → nobody
I wasn't able to find and install the Crossrider apps mentioned in comment 0, but I was able to install a handful of Crossrider demo apps from http://crossrider.com/developer/demo along with Babylon 9.0. Using Firefox 15.0b3 on Windows XP I've been unable to reproduce any crashes so far. Many of the comments mention that the crashes happen after resuming from overnight idle. Here are the following scenarios I tried with Firefox running and content loaded:

* Manually shutdown to standby and resume
* Set stand-by timeout to 60 seconds, wait for standby and resume
* Simulate a standby via power button and resume
* Set hibernate timeout to 120 seconds, wait for hibernation and resume

At least from the very minimal user scenarios I can think of, this bug is not reproducible. Bobby, I'm not sure how to QA this from a mochitest perspective. Some instruction would be appreciated.
With all the add-ons installed from before I tried opening each of the above URLs one by one in a new tab. Part way through Firefox crashed. I don't know whether the signature is the same because the report is still being processed. The ID is 925c0b89-e0ff-442d-8820-c33c82120809 just in case some else wants to check. One thing I did notice was the crash happened when clicking OK on one of the dialogs (each crossrider add-on executes JS and prompts on tab load).

Again, I'm not yet certain the crash I saw was the same, and I've not yet re-encountered the crash.
Retried the same test as in comment 16 but this time having all the tabs loaded, entering stand-by mode, then resuming and switch around to different tabs; no crash reproduced.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #16)
> The ID is 925c0b89-e0ff-442d-8820-c33c82120809 just in case some
> else wants to check.
It's indeed the same crash: bp-925c0b89-e0ff-442d-8820-c33c82120809.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #14)
> Bobby, I'm not sure how to QA this from a mochitest
> perspective. Some instruction would be appreciated.

Not sure. Does QA have any experience reproducing random oranges from tinderbox? I don't have any great ideas offhand.

Anyway, it sounds like you managed to at least reproduce the crash once, which is awesome! If we can hone in on that and get something reliable, I'll jump right in. Alternatively, if we get really stuck, we can just capitulate on this bug (and the random orange) and just add code to handle the failures that shouldn't be happening. I'd really like to avoid that though. :-(
Anthony, did you test this with a release build or a debug build? Using a debug build might help here. Bobby, could you add assertions that might help, maybe in a try build?
(In reply to Bill McCloskey (:billm) from comment #20)
> Anthony, did you test this with a release build or a debug build? Using a
> debug build might help here. Bobby, could you add assertions that might
> help, maybe in a try build?

If my theory is correct, we've already got the relevant assertion - the one being triggered in the random orange in bug 777875.
Philor suggests in bug 782167 comment 5 that reproducing might require triggering a tooltip.
Apologies for the delayed response. 

(In reply to Bill McCloskey (:billm) from comment #20)
> Anthony, did you test this with a release build or a debug build?

I was not using a debug build. I'm unable to get one working on Windows though. I've spent hours trying to get a Windows debug environment set up and it just won't work.

That said, I've not been able to re-reproduce this crash.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #23)
> I was not using a debug build. I'm unable to get one working on Windows
> though. I've spent hours trying to get a Windows debug environment set up
> and it just won't work.
> 
> That said, I've not been able to re-reproduce this crash.

I think it should be possible to install a debug build without needing a build environment. They're available here, for example:
  https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2012-08-14-mozilla-inbound-debug/

Although you might need an extra file from here:
  https://developer.mozilla.org/en-US/docs/Running_Windows_Debug_Builds
Maybe that's where you had the problem?

In general, doing testing with debug builds is usually a lot more valuable than testing with release builds. It's often a lot easier to reliably reproduce things.
(In reply to Bill McCloskey (:billm) from comment #24)
> Although you might need an extra file from here:
>   https://developer.mozilla.org/en-US/docs/Running_Windows_Debug_Builds
> Maybe that's where you had the problem?

This is the exact problem that I always seem to have but installing the SDK as instructed via the link above does not resolve the issue for me. If you want to continue to help me troubleshoot my personal issues that would be great, but lets this onto IRC or email.

Otherwise, I'm not sure that I can be much more help on this bug.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #25)
> This is the exact problem that I always seem to have but installing the SDK
> as instructed via the link above does not resolve the issue for me. If you
> want to continue to help me troubleshoot my personal issues that would be
> great, but lets this onto IRC or email.

Kyle helpfully updated the docs at:
  https://developer.mozilla.org/en-US/docs/Running_Windows_Debug_Builds
It includes a new link to a file that should fix the problem.

I'm posting this here in case anyone else has this problem.
I've stumbled on a reliably reproducible crash condition but I can't tell if it's related to this bug or not.

Steps:
1. Install build from comment 24 in Windows XP
2. Install Babylon 9 Pro (including toolbar)
3. Install the Social Anywhere Crossrider App
> http://crossrider.com/install/519-social-anywhere
4. Restart Firefox and open Google+

Firefox eventually crashes if I let it site there on Google+ for a minute. Allowing the session to restore on restart triggers the crash again.

Debug output shows an assertion:
###!!! ASSERTION: JSEventListener has wrong script context?: 'stack && NS_SUCCEEDED(stack->Peek(&cx)) && cx && GetScriptContextFromJSContext(cx) == mContext', file e:/builds/moz2_slave/m-in-w32-dbg/build/dom/src/events/nsJSEventListener.cpp, line 182

I received the following crash reports:
bp-243b8cd0-223d-4612-bb62-fdd5f2120816
bp-ce190b68-ebe6-4066-b18e-507522120816
bp-fe3f01d4-9161-4684-b903-1c8da2120816
bp-a7958e00-d207-4665-81f0-248e82120816
bp-a1f68248-03b5-4155-a8df-745ce2120816

This does not reproduce when using a non-debug build like Firefox 15.0b5.
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #27)
> I've stumbled on a reliably reproducible crash condition but I can't tell if
> it's related to this bug or not.

What indicates that it's related? I don't see a stack in the crash reports. Were you able to get a stack some other way?
I don't know that it's a related at all, apart from the fact that it's a reproducible crash with Babylon toolbar and Crossrider Apps installed.
This crash is at #38 for FF15 and we're a day away from going to build on our final Beta so I'm wontfixing this for 15 and we can continue to investigate (and watch for more correlations) in 16, esp. once it goes to our Beta audience we might get some new leads here.
While it's unfortunate we have this regression, if we can't find STR soon, we'll likely untrack for FF16's release. This is not a top crasher.
A reliable set of steps to reproduce this bug still elude me. At this point I don't see anything that QA can do. Please re-add qawanted if a new lead comes to light.
Keywords: qawantedsteps-wanted
It's correlated with malware in FF 15:
* DLLs:
     27% (67/246) vs.   1% (823/150321) DataMngrHlpFF15.dll (used in Bandoo/iMesh/Discordia's extensions)
     30% (73/246) vs.   4% (6469/150321) datamngr.dll (used in Bandoo/iMesh/Discordia's extensions)
* Extensions:
     27% (67/246) vs.   2% (2330/150321) {1FD91A9C-410C-4090-BBCC-55D3450EF433} (DataMngr)
     12% (29/246) vs.   2% (2813/150321) wtxpcom@mybrowserbar.com (Widgi Toolbar Platform)
     15% (36/246) vs.   5% (7333/150321) plugin@yontoo.com (Yontoo)
     13% (33/246) vs.   4% (5939/150321) ffxtlbr@funmoods.com (Funmmods)
     11% (27/246) vs.   2% (2711/150321) {BBDA0591-3099-440a-AA10-41764D9DB4DB} (Symantec IPS)
     11% (26/246) vs.   2% (2326/150321) {2D3F3651-74B9-4795-BDEC-6DA2F431CB62} (Norton Toolbar)
     11% (26/246) vs.   2% (2379/150321) {99079a25-328f-4bd4-be04-00955acaa0a7} (Searchqu Toolbar)
     15% (38/246) vs.   7% (10908/150321) ffxtlbr@babylon.com (Babylon Toolbar)
      9% (22/246) vs.   1% (1749/150321) ytd@mybrowserbar.com (YTD Toolbar)
I get frequent crashes in google docs presentation editor. Basically one crash every 15 minutes or so of work on creating a presentation. Most of the crashes look like the stack trace is wrong/impossible, just linux-gate.so@0x424 or such (at least I can't understand what that could mean), but now I got one that matches this bug. Perhaps using google docs presentation editor can help find STR (I have not seen a specific action though that causes it - looks like I could be clicking on any of their GUI elements, like changing font color etc., to trigger the crash).
linux-gate.so is a virtual shared library used for virtual system calls, FWIW:
http://www.trilithium.com/johan/2005/08/linux-gate/
Interesting, thanks. So what does it mean when I get frequent crashes in a site that have that signature? Is there some way to see which specific syscall it is? The stack trace above linux-gate (that is, what would presumably be doing the syscall) doesn't seem useful for some reason.
Do you have a link to one of these crashes?
It's #30 top browser crasher in 15.0.1.
Keywords: topcrash
Depends on: 795248
With combined signatures, it's #41 top browser crasher in 16.0.1.

It's now correlated to Babylon like comment 27 mentions it:
  nsGlobalWindow::SetNewDocument(nsIDocument*, nsISupports*, bool)|EXCEPTION_ACCESS_VIOLATION_READ (99 crashes)
     43% (43/99) vs.   2% (1473/92192) browsemngr-16.0.dll
         26% (26/99) vs.   1% (514/92192) 2.3.782.39
         17% (17/99) vs.   1% (959/92192) 2.3.787.43
     46% (46/99) vs.   5% (4416/92192) browsemngr.dll
          8% (8/99) vs.   0% (455/92192) 2.3.762.17
         18% (18/99) vs.   1% (930/92192) 2.3.765.24
         20% (20/99) vs.   3% (2560/92192) 2.3.787.43
          0% (0/99) vs.   0% (148/92192) 2.3.796.11
As written in Bug 803022 I have STR for the crash [1] but I can only reproduce it with a personal account on a vBulletin forum website. It may only be reproducible on Mac OS X and may only happen when using the touchpad swipe gesture (will check that again when I temporarily change the password). It happens with a clean(?) Fx 16.0.2 profile and Firebug installed (but if I recall correctly, it crashed without Firebug, too). 

Please tell me to whom I should send the credentials (via email) and STR (or check out Bug 803022 Comment 12) for the website to reproduce the crash. 

[1] https://crash-stats.mozilla.com/report/index/bp-197440a2-e017-42fd-ae06-f9b232121104
I am able to reproduce this crash 100% by loading http://m.whiskeymilitia.com/. Here is my crash report on Aurora using Mac 10.8: https://crash-stats.mozilla.com/report/index/bp-f375c33b-3e3a-4816-b00d-a14c72121108

Note that the crash doesn't happen right away so you have to be patient. I will point to this comment in the bug referenced in Comment 41 as well.
Keywords: reproducible
Confirm on Fx 16.0.2 on Mac OS X 10.8.2 on MBA 2012. UI stalled with "pagead2.googlesyndication.com" (or similar) in the status bar. Took some time to crash. 

https://crash-stats.mozilla.com/report/index/bp-01ba0fb6-4e7a-456e-be9b-770cf2121108
Oh yeah it was a mostly clean profile and the STR from Comment 42. I was only testing the stuff from bug 803022 for a couple of minutes, otherwise it's a fresh profile from like 30 minutes ago (no extensions, only QuickTime plugin). 

Bug 803022 did not turn out to be reproducible any more (concerning the crash).
Here's the stack I get on m.whiskeymilitia.com
So what's happening with whiskeymilitia is that the script is dispatching events in nested event handlers. Eventually it hits the recursion limit, but the exception doesn't propagate very far up, because HandleEventInternal squelches exceptions.

Eventually, the slow script dialog tries to take over, but then _it_ runs into trouble, because of the native stack limit (mccr8 had an idea to let privileged code run with a higher native stack limit, did that every go anywhere?).

Anyway, I wasn't able to reproduce the SetNewDocument crash myself, but see how it could happen (GetCurrentInnerWindow() might be returning null). Let's see if this patch does the trick.
Comment on attachment 683790 [details] [diff] [review]
Check for null currentInner when deciding to reuse inner windows. v1

Not far from dmandelin's patch ;)
Attachment #683790 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] from comment #49)
> Not far from dmandelin's patch ;)

Yes, but at least now we understand it a little better :-)

https://hg.mozilla.org/integration/mozilla-inbound/rev/62769304221f
> mccr8 had an idea to let privileged code run with a higher native stack limit, did that every go anywhere?

This was Jesse's idea, not mine. :)  Bug 813646, just filed.
https://hg.mozilla.org/mozilla-central/rev/62769304221f
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Blocks: 803022
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.