Closed Bug 50705 Opened 24 years ago Closed 24 years ago

root_points_to_gcArenaPool assertion / crashes due to leaked root name "timeout.expr"

Categories

(SeaMonkey :: General, defect, P1)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: sean, Assigned: danm.moz)

References

Details

(Keywords: crash, testcase, Whiteboard: [nsbeta3-][PDTP2] [rtm++])

Attachments

(6 files)

Error: The address passed to JS_AddNamedRoot currently holds an invalid jsval.
   This is usually caused by a missing call to JS_RemoveRoot.
   Root name is "timeout.expr".
Assertion failure: root_points_to_gcArenaPool, at 
h:\moz\mozilla\js\src\jsgc.c:668

A callstack for this assertion is at: 
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=13650
(which was attached to bug 43466)

This assertion is occurring with a modified Beatnik eMix.  Currently available 
eMixes do not run in Mozilla - will attempt to generate a simplified testcase.
Lucky jst prolly gets this one.  I'll take a quick look too, and see if 
some obvious travis-ty in the code is to blame.

/be
I get something quite like this when loading a simple page with two frames. It
doesn't say anything about the gc, instead it gets a segmentation fault, but the
backtrace sure looks the same.

Here is a backtrace:
 #0  0x403320b9 in chunk_free (ar_ptr=0x41900010, p=0x5a500) at malloc.c:3094
#1  0x40331fba in __libc_free (mem=0x41df94a8) at malloc.c:3023
#2  0x401ffa65 in PR_Free (ptr=0x41df94a8) at prmem.c:66
#3  0x400f4af8 in nsMemoryImpl::Free (this=0x8065ba0, ptr=0x41df94a8) at
nsMemoryImpl.cpp:143
#4  0x400f5260 in nsMemory::Free (ptr=0x41df94a8) at nsMemoryImpl.cpp:269
#5  0x400a7710 in nsStr::Free (aDest=@0x41df6010) at nsStr.cpp:685
#6  0x400a69e2 in nsStr::Destroy (aDest=@0x41df6010) at nsStr.cpp:89
#7  0x400a7d1a in nsCString::~nsCString (this=0x41df600c, __in_chrg=2) at
nsString.cpp:137
#8  0x40f6589d in nsXBLBinding::~nsXBLBinding (this=0x41df6000, __in_chrg=3) at
nsXBLBinding.cpp:367
#9  0x40f64fe7 in nsXBLBinding::Release (this=0x41df6000) at
nsXBLBinding.cpp:265
#10 0x406c0cef in nsXULElement::SetDocument (this=0x41d70f70, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at ../../../dist/include/nsCOMPtr.h:489
#11 0x406c1837 in nsXULElement::SetDocument (this=0x41dcaf28, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at nsXULElement.cpp:2306
#12 0x40fd42dd in nsGenericElement::SetDocumentInChildrenOf
(aContent=0x41dcacf4, aDocument=0x0, aCompileEventHandlers=1) at
nsGenericElement.cpp:1237
#13 0x40fd5028 in nsGenericElement::SetDocument (this=0x41dcad04, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at nsGenericElement.cpp:1326
#14 0x40eee5b1 in nsXMLElement::SetDocument (this=0x41dcacf0, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at nsXMLElement.h:67
#15 0x40f6ce39 in nsXBLBinding::ChangeDocument (this=0x41d70e98,
aOldDocument=0x8592bf8, aNewDocument=0x0) at nsXBLBinding.cpp:1159
#16 0x406c0ad6 in nsXULElement::SetDocument (this=0x41dcdd08, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at nsXULElement.cpp:2224
#17 0x40d13504 in nsGfxScrollFrame::SetDocumentForAnonymousContent
(this=0x41df3a0c, aDocument=0x0, aDeep=1, aCompileEventHandlers=1)
    at nsGfxScrollFrame.cpp:461
#18 0x40f9d5bd in nsDocument::SetScriptGlobalObject (this=0x8592bf8,
aScriptGlobalObject=0x0) at nsDocument.cpp:1716
#19 0x40faaefb in DocumentViewerImpl::~DocumentViewerImpl (this=0x8592f08,
__in_chrg=3) at nsDocumentViewer.cpp:410
#20 0x40faab50 in DocumentViewerImpl::Release (this=0x8592f08) at
nsDocumentViewer.cpp:348
#21 0x41637fbe in nsCOMPtr<nsIContentViewer>::assign_with_AddRef
(this=0x41d86440, rawPtr=0x0) at ../../dist/include/nsCOMPtr.h:471
#22 0x41620990 in nsDocShell::SetupNewViewer (this=0x41d863b8,
aNewViewer=0x41dd5dd0) at ../../dist/include/nsCOMPtr.h:583
#23 0x4162b50a in nsWebShell::SetupNewViewer (this=0x41d863b8,
aViewer=0x41dd5dd0) at nsWebShell.cpp:386
#24 0x4161e121 in nsDocShell::Embed (this=0x41d863b8, aContentViewer=0x41dd5dd0,
aCommand=0x41644b64 "", aExtraInfo=0x0) at nsDocShell.cpp:2393
#25 0x4162b76d in nsWebShell::Embed (this=0x41d863b8, aContentViewer=0x41dd5dd0,
aCommand=0x41644b64 "", aExtraInfo=0x0) at nsWebShell.cpp:415
#26 0x4161f01c in nsDocShell::CreateContentViewer (this=0x41d863b8,
aContentType=0xbffff674 "text/html", aOpenedChannel=0x41dcc5c0, 
    aContentHandler=0xbffff6e4) at nsDocShell.cpp:2561
#27 0x41633994 in nsDSURIContentListener::DoContent (this=0x41d85b88,
aContentType=0xbffff674 "text/html", aCommand=0, aWindowTarget=0x4012bddc "", 
    aOpenedChannel=0x41dcc5c0, aContentHandler=0xbffff6e4,
aAbortProcess=0xbffff62c) at nsDSURIContentListener.cpp:99
#28 0x4165ba32 in nsDocumentOpenInfo::DispatchContent (this=0x41928948,
aChannel=0x41dcc5c0, aCtxt=0x0) at nsURILoader.cpp:359
#29 0x4165aff7 in nsDocumentOpenInfo::OnStartRequest (this=0x41928948,
aChannel=0x41dcc5c0, aCtxt=0x0) at nsURILoader.cpp:233
#30 0x40875c32 in nsFileChannel::OnStartRequest (this=0x41dcc5c0,
transportChannel=0x4190ea00, context=0x0) at nsFileChannel.cpp:633
#31 0x407f6791 in nsOnStartRequestEvent::HandleEvent (this=0x4191de98) at
nsAsyncStreamListener.cpp:210
#32 0x407f5e76 in nsStreamListenerEvent::HandlePLEvent (aEvent=0x41daf8f8) at
nsAsyncStreamListener.cpp:97
#33 0x400e7132 in PL_HandleEvent (self=0x41daf8f8) at plevent.c:587
#34 0x400e7036 in PL_ProcessPendingEvents (self=0x80b1178) at plevent.c:528
#35 0x400e8bb0 in nsEventQueueImpl::ProcessPendingEvents (this=0x80b1140) at
nsEventQueue.cpp:356
#36 0x41309abf in event_processor_callback (data=0x80b1140, source=8,
condition=GDK_INPUT_READ) at nsAppShell.cpp:158
someone else want to own this?
I'd be happy to take a look at it, but I've never noticed this happening and 
there's no testcase in this bug. The source for the simple page with two frames 
mentioned in the above stacktrace would be handy.
Depends on: 50990
danm: Sorry, my crash was due to some modifications in my tree that accidentaly
modified some memory. I've fixed that now.
  Adding to Alex's comment above: I've been talking offline with Sean, who tells 
me the bug can't be seen outside the context of a commercial product he's working 
on. It's an HTML/JS entity that's also tightly bound to a plugin, though he 
thinks the plugin isn't the part that's responsible for the problem mentioned in 
this bug. Still, this product is required to reproduce the problem and there are 
currently legal issues with seeing the product, so we're presently rather stymied 
on this bug.
This should be reassigned to sean, or someone else who is working on his project.
taking it until we get a testcase that doesn't require legal paperwork or 
someone will sign the paperwork (which Dan has said he will do when he gets 
back next week).
Assignee: asa → sean
reassigning back to danm.

I've put together a testcase (no forms to sign) that is available at:
http://www.mindspring.com/~sean_/mozilla/gcassert/

The testcase consists of 3 html pages.  Steps to repro are on the index page.
I'm able to repro using a build from the tip Sept 6.
Assignee: sean → danm
Keywords: beatnik, testcase
No longer depends on: 50990
cc jrgm
This no longer fits our profile for nsbeta3+ bugs, so ->future.
Target Milestone: --- → Future
The JS_ASSERT that occurs stops debug builds of mozilla dead in its tracks (no 
ignore option in the dialog, just ok to debug and cancel to stop the process).

In a release build, will an invalid jsval result in a crash or a leak?

Too bad about JS_ASSERT, but I think the developer who hits it should debug it
(we're all debugger-enabled, right?).

We do not want any of this O(n^2) growth rate code enabled for non-DEBUG builds,
it will slow the GC down (and I just fixed a separate O(bad) bug in the GC!). So
they will probably crash on an invalid root, especially one in freed or recycled
memory.  That's the breaks.

/be
I didn't mean to imply that the debug code should be enabled in a release 
build - just trying to figure out the implications for Beatnik content if the 
bug is not fixed.

In light of that assessment, adding crash keyword.
Keywords: crash
Adding vidur and heikki to cc: list for insight on DOM object garbage 
collection, because that's what's causing the crash. Sean might be able to fix 
this if he can get a bit of Q&A. 
Confirming that this does crash the browser in release builds using this 
morning's talkback build - there are about 5 talkbacks from me this evening and 
they should all reference bug 50705.
Will continue to investigate.
nominating rtm
Keywords: rtm
Increasing priority to P1 and nominating as an nsbeta3/rtm crasher as we've now 
verified that:
1) this problem indeed causes crashes in release builds
2) this impacts the stability/usability of Beatnik, which is a high profile 
plug-in partner
3) high profile web sites such as http://www.beatnik.com/emix, 
http://digital.yahoo.com, http://www.mtv.com and artist websites all have emixes 
which may crash due to this bug

Clearing Future milestone to -- to trigger re-evaluation based on this new data.
Keywords: nsbeta3
Priority: P3 → P1
Target Milestone: Future → ---
nsbeta3+, based on EKrock's assessment.
Whiteboard: [nsbeta3+]
Target Milestone: --- → M18
I bet this bug, bug 53094, and maybe even bug 53123 are all symptoms of the same 
bad change that's causing JSContexts (probably DOM ones) to be destroyed but not 
popped off the per-thread context stack maintained by XPConnect.  And probably 
to be destroyed without a timeout cleanup.

But I could be wrong, and these are separate bugs as well as symptoms.  My two 
cents (for now); hope it stirs the pot.

/be
I can confirm that JS_AddNamedRoot(cx, &timeout->expr, "timeout.expr") is being 
called one more time than ::JS_RemoveRoot(cx, &aTimeout->expr).
I can also confirm that there is a call to GlobalWindowImpl::DropTimeout in 
which both aContext and mContext are null - so JS_RemoveRoot is not called.

And I've also received one other root_points_to_gcArenaPool assertion twice 
with the root name "nsGenericElement::mScriptObject".
Sean: what's the stack when DropTimeout is called with mContext null?  Can you 
somehow trace the stack that destroyed mContext?  This could be a dup of bug 
53094.

/be
Attached a trace of what I think are the calls leading to this problem.  This 
is from a build based on a pull from the tip around noon.  Line numbers will 
differ due to additional logging I've added.

There is a timer set via js that eventually leads to the closing of a popup 
window.  The window is closed during the servicing of the timer being fired.  
mContext is set to nsnull 3 times during the servicing of the timer.  
DropTimeout does not fail during the servicing of this particular timer event 
as the ref_count of the timer does not hit 0 - so no attempt is made 
to 'really' drop the timeout.

The last trace in the attachment I think shows another timer being serviced on 
the closed window - this is the DropTimeout that fails.

PDT:  Very bad, but not a ship stopper.  Assuming emix isn't high frequency.
Evidence that the bad root could cause even more damage could raise priority.
Whiteboard: [nsbeta3+] → [nsbeta3+][PDTP2]
*** Bug 53509 has been marked as a duplicate of this bug. ***
Note that I marked Bug 53509 as a dup. The stack over there is different (and 
the bug owner might want to give it a look), but it is the same named root.
updating summary
Summary: root_points_to_gcArenaPool assertion in SetNewDocument → root_points_to_gcArenaPool assertion due to leaked root name "timeout.expr"
mass-adding rtm keyword to all open nsbeta3+ xptoolkit bugs
marking nsbeta3- per clayton's comments.  already nominated for rtm.
Whiteboard: [nsbeta3+][PDTP2] → [nsbeta3-][PDTP2]
53628 is a dup of this, but it's a crasher.

updating summary.

danm / brendan:  I see in mozilla/dom/src/base/nsGlobalWindow.cpp where we call
JS_AddNamedRoot() and were we call JS_RemoveRoot(cx, &aTimeout->expr);

any pointers or tips on how to debug / fix this? 

what can I do to help fix this?

this crasher blocks #47521

Summary: root_points_to_gcArenaPool assertion due to leaked root name "timeout.expr" → root_points_to_gcArenaPool assertion / crashes due to leaked root name "timeout.expr"
*** Bug 53628 has been marked as a duplicate of this bug. ***
Opened bug 54121 for the root_points_to_gcArenaPool assertion with root name
"nsGenericElement::mScriptObject" I mentioned 9/21 - have seen that a lot more
in the last week.
Whiteboard: [nsbeta3-][PDTP2] → [nsbeta3-][PDTP2] [rtm need info]
Testcase in 9/6 post is blocked by bug 55275
Another trivial way of reproducing this crash is to go to www.fortunecity.com, 
which uses some plugin called comet, I think, and then navigate around a bit. 
Whoever tries to fix this, if they could check that site at well, I'd appreciate 
it.
Is this bug still a worthwhile candidate for fixing in N6?  That'll probably
require a simple/safe fix in the next day or so, or a clearer picture of how
this affects MTBF.
I think the best way to get a clearer picture of how this affects mtbf is to fix
it. As near as I can tell, it's a topcrasher, and as checkins for a few of the
other topcrashers get checked in, it might even become *the* topcrasher. A lot
of the js type crashes bounce around a lot and get marked dups, etc, because the
stack traces are so similar. At least for this one we have an easily
reproducible case, though I haven't heard anything about how hard it is to fix.
severity->critical
Severity: normal → critical
Whiteboard: [nsbeta3-][PDTP2] [rtm need info] → [nsbeta3-][PDTP2] [rtm+ need info]
Also crashes under Mac OS 9.0.4, Moz build 2000100720-M18.

Stdlog from crash to follow.
updating platform/OS per last post

Also note that bug 55881 and bug 55445 may be related.
OS: Windows 2000 → All
Hardware: PC → All
The crash is caused by a JS timeout (actually an Interval) that outlives the JS 
context of its owning window. It is finally deleted, but there being no context 
from which to unroot it, it was never unrooted. At some later time a JS GC would 
kick in and die. This is fairly well explained in the comments for a minimal, 
rtm-ready patch attached below.
Status: NEW → ASSIGNED
Whiteboard: [nsbeta3-][PDTP2] [rtm+ need info] → [nsbeta3-][PDTP2] [rtm+ need info] patch awaiting review
I'm also attaching local copies of the source for sean's testcase. We'll need to 
revisit this problem because the solution posted just above is suboptimal, so 
I'd like to make sure we have the test case always available.
I'll try this patch out on the pr3 branch.
I ran with the attached patch - I still crash when I go to www.fortunecity.com
and click on the music scene link. I was told that crash was a dup of this bug,
so if you could try that url with your fix, I'd appreciate it.
Patch looks good to me, I'm unable to reproduce the fortunecity.com crash so I
dunno if that's related or not. r=jst
  The fortunecity crash happens pretty readily on my machine, with or without my 
fix for this bug. But I don't understand why it's rumoured to be a duplicate bug. 
The roots' names are different: this one is timeout.expr; fortunecity's is 
nsGenericElement::mScriptObject.
  Is there a separate bug for the fortunecity crash? It wants one. I don't see 
one. I'm removing that URL from this bug's summary.
bug 54121 is for nsGenericElement::mScriptObject
I think if you're gonna initialize rt = nsnull, you might as well test after you
try the do_GetService and the GetRuntime calls that rt is not null, and if it is
still null, bail (the roots must not matter, if the runtime is gone -- probably
some odd shutdown case, if it can happen at all).

Fix that, and a=brendan@mozilla.org

/be
Alright, will do. Got a point there, after all.
Whiteboard: [nsbeta3-][PDTP2] [rtm+ need info] patch awaiting review → [nsbeta3-][PDTP2] [rtm+]
rtm++
Whiteboard: [nsbeta3-][PDTP2] [rtm+] → [nsbeta3-][PDTP2] [rtm++]
.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Oh right. I did have something to say. Bugzilla protests so much I tend to ignore 
him. Fixed on trunk and Netscape rtm branch.
FYI: Branch build 2000-10-16-09MN6, NT4. The crash in Mail that was mentioned in
bug# 47521 and 53628 no longer occurs.
can someone please verify this? I'm not qualifed to do this...
you will have problems reproduce this since the start up home page has changed
to a giant image map.
vrf'd
Status: RESOLVED → VERIFIED
Sean: how did you verify this? did you test on mac, window and linux?  I think 
that the correct way to verify is to give an lxr url for the checkin.
vrfd on Win NT and Win2K, using an undistributable Beatnik eMix and using the 
attached testcase.
checkin lxr url: http://bonsai.mozilla.org/cvsview2.cgi?
diff_mode=context&whitespace_mode=show&file=nsGlobalWindow.cpp&root=/cvsroot&sub
dir=mozilla/dom/src/base&command=DIFF_FRAMESET&rev1=1.354&rev2=1.355
Keywords: beatnik, rtmnsrtm
Product: Browser → Seamonkey
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: