Closed Bug 50705 Opened 25 years ago Closed 25 years ago

root_points_to_gcArenaPool assertion / crashes due to leaked root name "timeout.expr"

Categories

(SeaMonkey :: General, defect, P1)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: sean, Assigned: danm.moz)

References

Details

(Keywords: crash, testcase, Whiteboard: [nsbeta3-][PDTP2] [rtm++])

Attachments

(6 files)

Error: The address passed to JS_AddNamedRoot currently holds an invalid jsval. This is usually caused by a missing call to JS_RemoveRoot. Root name is "timeout.expr". Assertion failure: root_points_to_gcArenaPool, at h:\moz\mozilla\js\src\jsgc.c:668 A callstack for this assertion is at: http://bugzilla.mozilla.org/showattachment.cgi?attach_id=13650 (which was attached to bug 43466) This assertion is occurring with a modified Beatnik eMix. Currently available eMixes do not run in Mozilla - will attempt to generate a simplified testcase.
Lucky jst prolly gets this one. I'll take a quick look too, and see if some obvious travis-ty in the code is to blame. /be
I get something quite like this when loading a simple page with two frames. It doesn't say anything about the gc, instead it gets a segmentation fault, but the backtrace sure looks the same. Here is a backtrace: #0 0x403320b9 in chunk_free (ar_ptr=0x41900010, p=0x5a500) at malloc.c:3094 #1 0x40331fba in __libc_free (mem=0x41df94a8) at malloc.c:3023 #2 0x401ffa65 in PR_Free (ptr=0x41df94a8) at prmem.c:66 #3 0x400f4af8 in nsMemoryImpl::Free (this=0x8065ba0, ptr=0x41df94a8) at nsMemoryImpl.cpp:143 #4 0x400f5260 in nsMemory::Free (ptr=0x41df94a8) at nsMemoryImpl.cpp:269 #5 0x400a7710 in nsStr::Free (aDest=@0x41df6010) at nsStr.cpp:685 #6 0x400a69e2 in nsStr::Destroy (aDest=@0x41df6010) at nsStr.cpp:89 #7 0x400a7d1a in nsCString::~nsCString (this=0x41df600c, __in_chrg=2) at nsString.cpp:137 #8 0x40f6589d in nsXBLBinding::~nsXBLBinding (this=0x41df6000, __in_chrg=3) at nsXBLBinding.cpp:367 #9 0x40f64fe7 in nsXBLBinding::Release (this=0x41df6000) at nsXBLBinding.cpp:265 #10 0x406c0cef in nsXULElement::SetDocument (this=0x41d70f70, aDocument=0x0, aDeep=1, aCompileEventHandlers=1) at ../../../dist/include/nsCOMPtr.h:489 #11 0x406c1837 in nsXULElement::SetDocument (this=0x41dcaf28, aDocument=0x0, aDeep=1, aCompileEventHandlers=1) at nsXULElement.cpp:2306 #12 0x40fd42dd in nsGenericElement::SetDocumentInChildrenOf (aContent=0x41dcacf4, aDocument=0x0, aCompileEventHandlers=1) at nsGenericElement.cpp:1237 #13 0x40fd5028 in nsGenericElement::SetDocument (this=0x41dcad04, aDocument=0x0, aDeep=1, aCompileEventHandlers=1) at nsGenericElement.cpp:1326 #14 0x40eee5b1 in nsXMLElement::SetDocument (this=0x41dcacf0, aDocument=0x0, aDeep=1, aCompileEventHandlers=1) at nsXMLElement.h:67 #15 0x40f6ce39 in nsXBLBinding::ChangeDocument (this=0x41d70e98, aOldDocument=0x8592bf8, aNewDocument=0x0) at nsXBLBinding.cpp:1159 #16 0x406c0ad6 in nsXULElement::SetDocument (this=0x41dcdd08, aDocument=0x0, aDeep=1, aCompileEventHandlers=1) at nsXULElement.cpp:2224 #17 0x40d13504 in nsGfxScrollFrame::SetDocumentForAnonymousContent (this=0x41df3a0c, aDocument=0x0, aDeep=1, aCompileEventHandlers=1) at nsGfxScrollFrame.cpp:461 #18 0x40f9d5bd in nsDocument::SetScriptGlobalObject (this=0x8592bf8, aScriptGlobalObject=0x0) at nsDocument.cpp:1716 #19 0x40faaefb in DocumentViewerImpl::~DocumentViewerImpl (this=0x8592f08, __in_chrg=3) at nsDocumentViewer.cpp:410 #20 0x40faab50 in DocumentViewerImpl::Release (this=0x8592f08) at nsDocumentViewer.cpp:348 #21 0x41637fbe in nsCOMPtr<nsIContentViewer>::assign_with_AddRef (this=0x41d86440, rawPtr=0x0) at ../../dist/include/nsCOMPtr.h:471 #22 0x41620990 in nsDocShell::SetupNewViewer (this=0x41d863b8, aNewViewer=0x41dd5dd0) at ../../dist/include/nsCOMPtr.h:583 #23 0x4162b50a in nsWebShell::SetupNewViewer (this=0x41d863b8, aViewer=0x41dd5dd0) at nsWebShell.cpp:386 #24 0x4161e121 in nsDocShell::Embed (this=0x41d863b8, aContentViewer=0x41dd5dd0, aCommand=0x41644b64 "", aExtraInfo=0x0) at nsDocShell.cpp:2393 #25 0x4162b76d in nsWebShell::Embed (this=0x41d863b8, aContentViewer=0x41dd5dd0, aCommand=0x41644b64 "", aExtraInfo=0x0) at nsWebShell.cpp:415 #26 0x4161f01c in nsDocShell::CreateContentViewer (this=0x41d863b8, aContentType=0xbffff674 "text/html", aOpenedChannel=0x41dcc5c0, aContentHandler=0xbffff6e4) at nsDocShell.cpp:2561 #27 0x41633994 in nsDSURIContentListener::DoContent (this=0x41d85b88, aContentType=0xbffff674 "text/html", aCommand=0, aWindowTarget=0x4012bddc "", aOpenedChannel=0x41dcc5c0, aContentHandler=0xbffff6e4, aAbortProcess=0xbffff62c) at nsDSURIContentListener.cpp:99 #28 0x4165ba32 in nsDocumentOpenInfo::DispatchContent (this=0x41928948, aChannel=0x41dcc5c0, aCtxt=0x0) at nsURILoader.cpp:359 #29 0x4165aff7 in nsDocumentOpenInfo::OnStartRequest (this=0x41928948, aChannel=0x41dcc5c0, aCtxt=0x0) at nsURILoader.cpp:233 #30 0x40875c32 in nsFileChannel::OnStartRequest (this=0x41dcc5c0, transportChannel=0x4190ea00, context=0x0) at nsFileChannel.cpp:633 #31 0x407f6791 in nsOnStartRequestEvent::HandleEvent (this=0x4191de98) at nsAsyncStreamListener.cpp:210 #32 0x407f5e76 in nsStreamListenerEvent::HandlePLEvent (aEvent=0x41daf8f8) at nsAsyncStreamListener.cpp:97 #33 0x400e7132 in PL_HandleEvent (self=0x41daf8f8) at plevent.c:587 #34 0x400e7036 in PL_ProcessPendingEvents (self=0x80b1178) at plevent.c:528 #35 0x400e8bb0 in nsEventQueueImpl::ProcessPendingEvents (this=0x80b1140) at nsEventQueue.cpp:356 #36 0x41309abf in event_processor_callback (data=0x80b1140, source=8, condition=GDK_INPUT_READ) at nsAppShell.cpp:158
someone else want to own this?
I'd be happy to take a look at it, but I've never noticed this happening and there's no testcase in this bug. The source for the simple page with two frames mentioned in the above stacktrace would be handy.
Depends on: 50990
danm: Sorry, my crash was due to some modifications in my tree that accidentaly modified some memory. I've fixed that now.
Adding to Alex's comment above: I've been talking offline with Sean, who tells me the bug can't be seen outside the context of a commercial product he's working on. It's an HTML/JS entity that's also tightly bound to a plugin, though he thinks the plugin isn't the part that's responsible for the problem mentioned in this bug. Still, this product is required to reproduce the problem and there are currently legal issues with seeing the product, so we're presently rather stymied on this bug.
This should be reassigned to sean, or someone else who is working on his project.
taking it until we get a testcase that doesn't require legal paperwork or someone will sign the paperwork (which Dan has said he will do when he gets back next week).
Assignee: asa → sean
reassigning back to danm. I've put together a testcase (no forms to sign) that is available at: http://www.mindspring.com/~sean_/mozilla/gcassert/ The testcase consists of 3 html pages. Steps to repro are on the index page. I'm able to repro using a build from the tip Sept 6.
Assignee: sean → danm
Keywords: beatnik, testcase
No longer depends on: 50990
cc jrgm
This no longer fits our profile for nsbeta3+ bugs, so ->future.
Target Milestone: --- → Future
The JS_ASSERT that occurs stops debug builds of mozilla dead in its tracks (no ignore option in the dialog, just ok to debug and cancel to stop the process). In a release build, will an invalid jsval result in a crash or a leak?
Too bad about JS_ASSERT, but I think the developer who hits it should debug it (we're all debugger-enabled, right?). We do not want any of this O(n^2) growth rate code enabled for non-DEBUG builds, it will slow the GC down (and I just fixed a separate O(bad) bug in the GC!). So they will probably crash on an invalid root, especially one in freed or recycled memory. That's the breaks. /be
I didn't mean to imply that the debug code should be enabled in a release build - just trying to figure out the implications for Beatnik content if the bug is not fixed. In light of that assessment, adding crash keyword.
Keywords: crash
Adding vidur and heikki to cc: list for insight on DOM object garbage collection, because that's what's causing the crash. Sean might be able to fix this if he can get a bit of Q&A.
Confirming that this does crash the browser in release builds using this morning's talkback build - there are about 5 talkbacks from me this evening and they should all reference bug 50705. Will continue to investigate.
nominating rtm
Keywords: rtm
Increasing priority to P1 and nominating as an nsbeta3/rtm crasher as we've now verified that: 1) this problem indeed causes crashes in release builds 2) this impacts the stability/usability of Beatnik, which is a high profile plug-in partner 3) high profile web sites such as http://www.beatnik.com/emix, http://digital.yahoo.com, http://www.mtv.com and artist websites all have emixes which may crash due to this bug Clearing Future milestone to -- to trigger re-evaluation based on this new data.
Keywords: nsbeta3
Priority: P3 → P1
Target Milestone: Future → ---
nsbeta3+, based on EKrock's assessment.
Whiteboard: [nsbeta3+]
Target Milestone: --- → M18
I bet this bug, bug 53094, and maybe even bug 53123 are all symptoms of the same bad change that's causing JSContexts (probably DOM ones) to be destroyed but not popped off the per-thread context stack maintained by XPConnect. And probably to be destroyed without a timeout cleanup. But I could be wrong, and these are separate bugs as well as symptoms. My two cents (for now); hope it stirs the pot. /be
I can confirm that JS_AddNamedRoot(cx, &timeout->expr, "timeout.expr") is being called one more time than ::JS_RemoveRoot(cx, &aTimeout->expr).
I can also confirm that there is a call to GlobalWindowImpl::DropTimeout in which both aContext and mContext are null - so JS_RemoveRoot is not called. And I've also received one other root_points_to_gcArenaPool assertion twice with the root name "nsGenericElement::mScriptObject".
Sean: what's the stack when DropTimeout is called with mContext null? Can you somehow trace the stack that destroyed mContext? This could be a dup of bug 53094. /be
Attached a trace of what I think are the calls leading to this problem. This is from a build based on a pull from the tip around noon. Line numbers will differ due to additional logging I've added. There is a timer set via js that eventually leads to the closing of a popup window. The window is closed during the servicing of the timer being fired. mContext is set to nsnull 3 times during the servicing of the timer. DropTimeout does not fail during the servicing of this particular timer event as the ref_count of the timer does not hit 0 - so no attempt is made to 'really' drop the timeout. The last trace in the attachment I think shows another timer being serviced on the closed window - this is the DropTimeout that fails.
PDT: Very bad, but not a ship stopper. Assuming emix isn't high frequency. Evidence that the bad root could cause even more damage could raise priority.
Whiteboard: [nsbeta3+] → [nsbeta3+][PDTP2]
*** Bug 53509 has been marked as a duplicate of this bug. ***
Note that I marked Bug 53509 as a dup. The stack over there is different (and the bug owner might want to give it a look), but it is the same named root.
updating summary
Summary: root_points_to_gcArenaPool assertion in SetNewDocument → root_points_to_gcArenaPool assertion due to leaked root name "timeout.expr"
mass-adding rtm keyword to all open nsbeta3+ xptoolkit bugs
marking nsbeta3- per clayton's comments. already nominated for rtm.
Whiteboard: [nsbeta3+][PDTP2] → [nsbeta3-][PDTP2]
53628 is a dup of this, but it's a crasher. updating summary. danm / brendan: I see in mozilla/dom/src/base/nsGlobalWindow.cpp where we call JS_AddNamedRoot() and were we call JS_RemoveRoot(cx, &aTimeout->expr); any pointers or tips on how to debug / fix this? what can I do to help fix this? this crasher blocks #47521
Summary: root_points_to_gcArenaPool assertion due to leaked root name "timeout.expr" → root_points_to_gcArenaPool assertion / crashes due to leaked root name "timeout.expr"
*** Bug 53628 has been marked as a duplicate of this bug. ***
Opened bug 54121 for the root_points_to_gcArenaPool assertion with root name "nsGenericElement::mScriptObject" I mentioned 9/21 - have seen that a lot more in the last week.
Whiteboard: [nsbeta3-][PDTP2] → [nsbeta3-][PDTP2] [rtm need info]
Testcase in 9/6 post is blocked by bug 55275
Another trivial way of reproducing this crash is to go to www.fortunecity.com, which uses some plugin called comet, I think, and then navigate around a bit. Whoever tries to fix this, if they could check that site at well, I'd appreciate it.
Is this bug still a worthwhile candidate for fixing in N6? That'll probably require a simple/safe fix in the next day or so, or a clearer picture of how this affects MTBF.
I think the best way to get a clearer picture of how this affects mtbf is to fix it. As near as I can tell, it's a topcrasher, and as checkins for a few of the other topcrashers get checked in, it might even become *the* topcrasher. A lot of the js type crashes bounce around a lot and get marked dups, etc, because the stack traces are so similar. At least for this one we have an easily reproducible case, though I haven't heard anything about how hard it is to fix.
severity->critical
Severity: normal → critical
Whiteboard: [nsbeta3-][PDTP2] [rtm need info] → [nsbeta3-][PDTP2] [rtm+ need info]
Also crashes under Mac OS 9.0.4, Moz build 2000100720-M18. Stdlog from crash to follow.
updating platform/OS per last post Also note that bug 55881 and bug 55445 may be related.
OS: Windows 2000 → All
Hardware: PC → All
The crash is caused by a JS timeout (actually an Interval) that outlives the JS context of its owning window. It is finally deleted, but there being no context from which to unroot it, it was never unrooted. At some later time a JS GC would kick in and die. This is fairly well explained in the comments for a minimal, rtm-ready patch attached below.
Status: NEW → ASSIGNED
Whiteboard: [nsbeta3-][PDTP2] [rtm+ need info] → [nsbeta3-][PDTP2] [rtm+ need info] patch awaiting review
I'm also attaching local copies of the source for sean's testcase. We'll need to revisit this problem because the solution posted just above is suboptimal, so I'd like to make sure we have the test case always available.
I'll try this patch out on the pr3 branch.
I ran with the attached patch - I still crash when I go to www.fortunecity.com and click on the music scene link. I was told that crash was a dup of this bug, so if you could try that url with your fix, I'd appreciate it.
Patch looks good to me, I'm unable to reproduce the fortunecity.com crash so I dunno if that's related or not. r=jst
The fortunecity crash happens pretty readily on my machine, with or without my fix for this bug. But I don't understand why it's rumoured to be a duplicate bug. The roots' names are different: this one is timeout.expr; fortunecity's is nsGenericElement::mScriptObject. Is there a separate bug for the fortunecity crash? It wants one. I don't see one. I'm removing that URL from this bug's summary.
bug 54121 is for nsGenericElement::mScriptObject
I think if you're gonna initialize rt = nsnull, you might as well test after you try the do_GetService and the GetRuntime calls that rt is not null, and if it is still null, bail (the roots must not matter, if the runtime is gone -- probably some odd shutdown case, if it can happen at all). Fix that, and a=brendan@mozilla.org /be
Alright, will do. Got a point there, after all.
Whiteboard: [nsbeta3-][PDTP2] [rtm+ need info] patch awaiting review → [nsbeta3-][PDTP2] [rtm+]
rtm++
Whiteboard: [nsbeta3-][PDTP2] [rtm+] → [nsbeta3-][PDTP2] [rtm++]
.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Oh right. I did have something to say. Bugzilla protests so much I tend to ignore him. Fixed on trunk and Netscape rtm branch.
FYI: Branch build 2000-10-16-09MN6, NT4. The crash in Mail that was mentioned in bug# 47521 and 53628 no longer occurs.
can someone please verify this? I'm not qualifed to do this...
you will have problems reproduce this since the start up home page has changed to a giant image map.
vrf'd
Status: RESOLVED → VERIFIED
Sean: how did you verify this? did you test on mac, window and linux? I think that the correct way to verify is to give an lxr url for the checkin.
vrfd on Win NT and Win2K, using an undistributable Beatnik eMix and using the attached testcase. checkin lxr url: http://bonsai.mozilla.org/cvsview2.cgi? diff_mode=context&whitespace_mode=show&file=nsGlobalWindow.cpp&root=/cvsroot&sub dir=mozilla/dom/src/base&command=DIFF_FRAMESET&rev1=1.354&rev2=1.355
Keywords: beatnik, rtmnsrtm
Product: Browser → Seamonkey
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: