root_points_to_gcArenaPool assertion / crashes due to leaked root name "timeout.expr"

VERIFIED FIXED in M18

Status

P1
critical
VERIFIED FIXED
18 years ago
14 years ago

People

(Reporter: sean, Assigned: danm.moz)

Tracking

({crash, testcase})

Trunk
crash, testcase

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [nsbeta3-][PDTP2] [rtm++])

Attachments

(6 attachments)

(Reporter)

Description

18 years ago
Error: The address passed to JS_AddNamedRoot currently holds an invalid jsval.
   This is usually caused by a missing call to JS_RemoveRoot.
   Root name is "timeout.expr".
Assertion failure: root_points_to_gcArenaPool, at 
h:\moz\mozilla\js\src\jsgc.c:668

A callstack for this assertion is at: 
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=13650
(which was attached to bug 43466)

This assertion is occurring with a modified Beatnik eMix.  Currently available 
eMixes do not run in Mozilla - will attempt to generate a simplified testcase.
Lucky jst prolly gets this one.  I'll take a quick look too, and see if 
some obvious travis-ty in the code is to blame.

/be

Comment 2

18 years ago
I get something quite like this when loading a simple page with two frames. It
doesn't say anything about the gc, instead it gets a segmentation fault, but the
backtrace sure looks the same.

Here is a backtrace:
 #0  0x403320b9 in chunk_free (ar_ptr=0x41900010, p=0x5a500) at malloc.c:3094
#1  0x40331fba in __libc_free (mem=0x41df94a8) at malloc.c:3023
#2  0x401ffa65 in PR_Free (ptr=0x41df94a8) at prmem.c:66
#3  0x400f4af8 in nsMemoryImpl::Free (this=0x8065ba0, ptr=0x41df94a8) at
nsMemoryImpl.cpp:143
#4  0x400f5260 in nsMemory::Free (ptr=0x41df94a8) at nsMemoryImpl.cpp:269
#5  0x400a7710 in nsStr::Free (aDest=@0x41df6010) at nsStr.cpp:685
#6  0x400a69e2 in nsStr::Destroy (aDest=@0x41df6010) at nsStr.cpp:89
#7  0x400a7d1a in nsCString::~nsCString (this=0x41df600c, __in_chrg=2) at
nsString.cpp:137
#8  0x40f6589d in nsXBLBinding::~nsXBLBinding (this=0x41df6000, __in_chrg=3) at
nsXBLBinding.cpp:367
#9  0x40f64fe7 in nsXBLBinding::Release (this=0x41df6000) at
nsXBLBinding.cpp:265
#10 0x406c0cef in nsXULElement::SetDocument (this=0x41d70f70, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at ../../../dist/include/nsCOMPtr.h:489
#11 0x406c1837 in nsXULElement::SetDocument (this=0x41dcaf28, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at nsXULElement.cpp:2306
#12 0x40fd42dd in nsGenericElement::SetDocumentInChildrenOf
(aContent=0x41dcacf4, aDocument=0x0, aCompileEventHandlers=1) at
nsGenericElement.cpp:1237
#13 0x40fd5028 in nsGenericElement::SetDocument (this=0x41dcad04, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at nsGenericElement.cpp:1326
#14 0x40eee5b1 in nsXMLElement::SetDocument (this=0x41dcacf0, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at nsXMLElement.h:67
#15 0x40f6ce39 in nsXBLBinding::ChangeDocument (this=0x41d70e98,
aOldDocument=0x8592bf8, aNewDocument=0x0) at nsXBLBinding.cpp:1159
#16 0x406c0ad6 in nsXULElement::SetDocument (this=0x41dcdd08, aDocument=0x0,
aDeep=1, aCompileEventHandlers=1) at nsXULElement.cpp:2224
#17 0x40d13504 in nsGfxScrollFrame::SetDocumentForAnonymousContent
(this=0x41df3a0c, aDocument=0x0, aDeep=1, aCompileEventHandlers=1)
    at nsGfxScrollFrame.cpp:461
#18 0x40f9d5bd in nsDocument::SetScriptGlobalObject (this=0x8592bf8,
aScriptGlobalObject=0x0) at nsDocument.cpp:1716
#19 0x40faaefb in DocumentViewerImpl::~DocumentViewerImpl (this=0x8592f08,
__in_chrg=3) at nsDocumentViewer.cpp:410
#20 0x40faab50 in DocumentViewerImpl::Release (this=0x8592f08) at
nsDocumentViewer.cpp:348
#21 0x41637fbe in nsCOMPtr<nsIContentViewer>::assign_with_AddRef
(this=0x41d86440, rawPtr=0x0) at ../../dist/include/nsCOMPtr.h:471
#22 0x41620990 in nsDocShell::SetupNewViewer (this=0x41d863b8,
aNewViewer=0x41dd5dd0) at ../../dist/include/nsCOMPtr.h:583
#23 0x4162b50a in nsWebShell::SetupNewViewer (this=0x41d863b8,
aViewer=0x41dd5dd0) at nsWebShell.cpp:386
#24 0x4161e121 in nsDocShell::Embed (this=0x41d863b8, aContentViewer=0x41dd5dd0,
aCommand=0x41644b64 "", aExtraInfo=0x0) at nsDocShell.cpp:2393
#25 0x4162b76d in nsWebShell::Embed (this=0x41d863b8, aContentViewer=0x41dd5dd0,
aCommand=0x41644b64 "", aExtraInfo=0x0) at nsWebShell.cpp:415
#26 0x4161f01c in nsDocShell::CreateContentViewer (this=0x41d863b8,
aContentType=0xbffff674 "text/html", aOpenedChannel=0x41dcc5c0, 
    aContentHandler=0xbffff6e4) at nsDocShell.cpp:2561
#27 0x41633994 in nsDSURIContentListener::DoContent (this=0x41d85b88,
aContentType=0xbffff674 "text/html", aCommand=0, aWindowTarget=0x4012bddc "", 
    aOpenedChannel=0x41dcc5c0, aContentHandler=0xbffff6e4,
aAbortProcess=0xbffff62c) at nsDSURIContentListener.cpp:99
#28 0x4165ba32 in nsDocumentOpenInfo::DispatchContent (this=0x41928948,
aChannel=0x41dcc5c0, aCtxt=0x0) at nsURILoader.cpp:359
#29 0x4165aff7 in nsDocumentOpenInfo::OnStartRequest (this=0x41928948,
aChannel=0x41dcc5c0, aCtxt=0x0) at nsURILoader.cpp:233
#30 0x40875c32 in nsFileChannel::OnStartRequest (this=0x41dcc5c0,
transportChannel=0x4190ea00, context=0x0) at nsFileChannel.cpp:633
#31 0x407f6791 in nsOnStartRequestEvent::HandleEvent (this=0x4191de98) at
nsAsyncStreamListener.cpp:210
#32 0x407f5e76 in nsStreamListenerEvent::HandlePLEvent (aEvent=0x41daf8f8) at
nsAsyncStreamListener.cpp:97
#33 0x400e7132 in PL_HandleEvent (self=0x41daf8f8) at plevent.c:587
#34 0x400e7036 in PL_ProcessPendingEvents (self=0x80b1178) at plevent.c:528
#35 0x400e8bb0 in nsEventQueueImpl::ProcessPendingEvents (this=0x80b1140) at
nsEventQueue.cpp:356
#36 0x41309abf in event_processor_callback (data=0x80b1140, source=8,
condition=GDK_INPUT_READ) at nsAppShell.cpp:158

Comment 3

18 years ago
someone else want to own this?
(Assignee)

Comment 4

18 years ago
I'd be happy to take a look at it, but I've never noticed this happening and 
there's no testcase in this bug. The source for the simple page with two frames 
mentioned in the above stacktrace would be handy.
(Reporter)

Updated

18 years ago
Depends on: 50990

Comment 5

18 years ago
danm: Sorry, my crash was due to some modifications in my tree that accidentaly
modified some memory. I've fixed that now.
(Assignee)

Comment 6

18 years ago
  Adding to Alex's comment above: I've been talking offline with Sean, who tells 
me the bug can't be seen outside the context of a commercial product he's working 
on. It's an HTML/JS entity that's also tightly bound to a plugin, though he 
thinks the plugin isn't the part that's responsible for the problem mentioned in 
this bug. Still, this product is required to reproduce the problem and there are 
currently legal issues with seeing the product, so we're presently rather stymied 
on this bug.

Comment 7

18 years ago
This should be reassigned to sean, or someone else who is working on his project.
(Reporter)

Comment 8

18 years ago
taking it until we get a testcase that doesn't require legal paperwork or 
someone will sign the paperwork (which Dan has said he will do when he gets 
back next week).
Assignee: asa → sean
(Reporter)

Comment 9

18 years ago
reassigning back to danm.

I've put together a testcase (no forms to sign) that is available at:
http://www.mindspring.com/~sean_/mozilla/gcassert/

The testcase consists of 3 html pages.  Steps to repro are on the index page.
I'm able to repro using a build from the tip Sept 6.
Assignee: sean → danm
Keywords: beatnik, testcase
(Reporter)

Updated

18 years ago
No longer depends on: 50990

Comment 10

18 years ago
cc jrgm

Comment 11

18 years ago
This no longer fits our profile for nsbeta3+ bugs, so ->future.
Target Milestone: --- → Future
(Reporter)

Comment 12

18 years ago
The JS_ASSERT that occurs stops debug builds of mozilla dead in its tracks (no 
ignore option in the dialog, just ok to debug and cancel to stop the process).

In a release build, will an invalid jsval result in a crash or a leak?

Too bad about JS_ASSERT, but I think the developer who hits it should debug it
(we're all debugger-enabled, right?).

We do not want any of this O(n^2) growth rate code enabled for non-DEBUG builds,
it will slow the GC down (and I just fixed a separate O(bad) bug in the GC!). So
they will probably crash on an invalid root, especially one in freed or recycled
memory.  That's the breaks.

/be
(Reporter)

Comment 14

18 years ago
I didn't mean to imply that the debug code should be enabled in a release 
build - just trying to figure out the implications for Beatnik content if the 
bug is not fixed.

In light of that assessment, adding crash keyword.
Keywords: crash
Adding vidur and heikki to cc: list for insight on DOM object garbage 
collection, because that's what's causing the crash. Sean might be able to fix 
this if he can get a bit of Q&A. 
(Reporter)

Comment 16

18 years ago
Confirming that this does crash the browser in release builds using this 
morning's talkback build - there are about 5 talkbacks from me this evening and 
they should all reference bug 50705.
Will continue to investigate.
(Reporter)

Comment 17

18 years ago
nominating rtm
Keywords: rtm
Increasing priority to P1 and nominating as an nsbeta3/rtm crasher as we've now 
verified that:
1) this problem indeed causes crashes in release builds
2) this impacts the stability/usability of Beatnik, which is a high profile 
plug-in partner
3) high profile web sites such as http://www.beatnik.com/emix, 
http://digital.yahoo.com, http://www.mtv.com and artist websites all have emixes 
which may crash due to this bug

Clearing Future milestone to -- to trigger re-evaluation based on this new data.
Keywords: nsbeta3
Priority: P3 → P1
Target Milestone: Future → ---

Comment 19

18 years ago
nsbeta3+, based on EKrock's assessment.
Whiteboard: [nsbeta3+]
Target Milestone: --- → M18
I bet this bug, bug 53094, and maybe even bug 53123 are all symptoms of the same 
bad change that's causing JSContexts (probably DOM ones) to be destroyed but not 
popped off the per-thread context stack maintained by XPConnect.  And probably 
to be destroyed without a timeout cleanup.

But I could be wrong, and these are separate bugs as well as symptoms.  My two 
cents (for now); hope it stirs the pot.

/be
(Reporter)

Comment 21

18 years ago
I can confirm that JS_AddNamedRoot(cx, &timeout->expr, "timeout.expr") is being 
called one more time than ::JS_RemoveRoot(cx, &aTimeout->expr).
(Reporter)

Comment 22

18 years ago
I can also confirm that there is a call to GlobalWindowImpl::DropTimeout in 
which both aContext and mContext are null - so JS_RemoveRoot is not called.

And I've also received one other root_points_to_gcArenaPool assertion twice 
with the root name "nsGenericElement::mScriptObject".
Sean: what's the stack when DropTimeout is called with mContext null?  Can you 
somehow trace the stack that destroyed mContext?  This could be a dup of bug 
53094.

/be
(Reporter)

Comment 24

18 years ago
Created attachment 15335 [details]
trace of the release of mContext
(Reporter)

Comment 25

18 years ago
Attached a trace of what I think are the calls leading to this problem.  This 
is from a build based on a pull from the tip around noon.  Line numbers will 
differ due to additional logging I've added.

There is a timer set via js that eventually leads to the closing of a popup 
window.  The window is closed during the servicing of the timer being fired.  
mContext is set to nsnull 3 times during the servicing of the timer.  
DropTimeout does not fail during the servicing of this particular timer event 
as the ref_count of the timer does not hit 0 - so no attempt is made 
to 'really' drop the timeout.

The last trace in the attachment I think shows another timer being serviced on 
the closed window - this is the DropTimeout that fails.

Comment 26

18 years ago
PDT:  Very bad, but not a ship stopper.  Assuming emix isn't high frequency.
Evidence that the bad root could cause even more damage could raise priority.
Whiteboard: [nsbeta3+] → [nsbeta3+][PDTP2]

Comment 27

18 years ago
*** Bug 53509 has been marked as a duplicate of this bug. ***

Comment 28

18 years ago
Note that I marked Bug 53509 as a dup. The stack over there is different (and 
the bug owner might want to give it a look), but it is the same named root.
(Reporter)

Comment 29

18 years ago
updating summary
Summary: root_points_to_gcArenaPool assertion in SetNewDocument → root_points_to_gcArenaPool assertion due to leaked root name "timeout.expr"

Comment 30

18 years ago
mass-adding rtm keyword to all open nsbeta3+ xptoolkit bugs

Comment 31

18 years ago
marking nsbeta3- per clayton's comments.  already nominated for rtm.
Whiteboard: [nsbeta3+][PDTP2] → [nsbeta3-][PDTP2]
53628 is a dup of this, but it's a crasher.

updating summary.

danm / brendan:  I see in mozilla/dom/src/base/nsGlobalWindow.cpp where we call
JS_AddNamedRoot() and were we call JS_RemoveRoot(cx, &aTimeout->expr);

any pointers or tips on how to debug / fix this? 

what can I do to help fix this?

this crasher blocks #47521

Summary: root_points_to_gcArenaPool assertion due to leaked root name "timeout.expr" → root_points_to_gcArenaPool assertion / crashes due to leaked root name "timeout.expr"
*** Bug 53628 has been marked as a duplicate of this bug. ***
(Reporter)

Comment 34

18 years ago
Opened bug 54121 for the root_points_to_gcArenaPool assertion with root name
"nsGenericElement::mScriptObject" I mentioned 9/21 - have seen that a lot more
in the last week.
(Assignee)

Updated

18 years ago
Whiteboard: [nsbeta3-][PDTP2] → [nsbeta3-][PDTP2] [rtm need info]
(Reporter)

Comment 35

18 years ago
Testcase in 9/6 post is blocked by bug 55275

Comment 36

18 years ago
Another trivial way of reproducing this crash is to go to www.fortunecity.com, 
which uses some plugin called comet, I think, and then navigate around a bit. 
Whoever tries to fix this, if they could check that site at well, I'd appreciate 
it.

Comment 37

18 years ago
Is this bug still a worthwhile candidate for fixing in N6?  That'll probably
require a simple/safe fix in the next day or so, or a clearer picture of how
this affects MTBF.

Comment 38

18 years ago
I think the best way to get a clearer picture of how this affects mtbf is to fix
it. As near as I can tell, it's a topcrasher, and as checkins for a few of the
other topcrashers get checked in, it might even become *the* topcrasher. A lot
of the js type crashes bounce around a lot and get marked dups, etc, because the
stack traces are so similar. At least for this one we have an easily
reproducible case, though I haven't heard anything about how hard it is to fix.

Comment 39

18 years ago
severity->critical
Severity: normal → critical
Whiteboard: [nsbeta3-][PDTP2] [rtm need info] → [nsbeta3-][PDTP2] [rtm+ need info]

Comment 40

18 years ago
Also crashes under Mac OS 9.0.4, Moz build 2000100720-M18.

Stdlog from crash to follow.

Comment 41

18 years ago
Created attachment 16848 [details]
stdlog from Macsbug of crash while trying testcase
(Reporter)

Comment 42

18 years ago
updating platform/OS per last post

Also note that bug 55881 and bug 55445 may be related.
OS: Windows 2000 → All
Hardware: PC → All
(Assignee)

Comment 43

18 years ago
The crash is caused by a JS timeout (actually an Interval) that outlives the JS 
context of its owning window. It is finally deleted, but there being no context 
from which to unroot it, it was never unrooted. At some later time a JS GC would 
kick in and die. This is fairly well explained in the comments for a minimal, 
rtm-ready patch attached below.
Status: NEW → ASSIGNED
Whiteboard: [nsbeta3-][PDTP2] [rtm+ need info] → [nsbeta3-][PDTP2] [rtm+ need info] patch awaiting review
(Assignee)

Comment 44

18 years ago
Created attachment 16957 [details] [diff] [review]
patch to unroot timeouts even lacking a window context
(Assignee)

Comment 45

18 years ago
I'm also attaching local copies of the source for sean's testcase. We'll need to 
revisit this problem because the solution posted just above is suboptimal, so 
I'd like to make sure we have the test case always available.
(Assignee)

Comment 46

18 years ago
Created attachment 16958 [details]
main test case file 50705.html. references the next two files as well.
(Assignee)

Comment 47

18 years ago
Created attachment 16959 [details]
second test case file 50705-check.html
(Assignee)

Comment 48

18 years ago
Created attachment 16960 [details]
third and final test case file 50705-remix.html

Comment 49

18 years ago
I'll try this patch out on the pr3 branch.

Comment 50

18 years ago
I ran with the attached patch - I still crash when I go to www.fortunecity.com
and click on the music scene link. I was told that crash was a dup of this bug,
so if you could try that url with your fix, I'd appreciate it.
Patch looks good to me, I'm unable to reproduce the fortunecity.com crash so I
dunno if that's related or not. r=jst
(Assignee)

Comment 52

18 years ago
  The fortunecity crash happens pretty readily on my machine, with or without my 
fix for this bug. But I don't understand why it's rumoured to be a duplicate bug. 
The roots' names are different: this one is timeout.expr; fortunecity's is 
nsGenericElement::mScriptObject.
  Is there a separate bug for the fortunecity crash? It wants one. I don't see 
one. I'm removing that URL from this bug's summary.
(Reporter)

Comment 53

18 years ago
bug 54121 is for nsGenericElement::mScriptObject
I think if you're gonna initialize rt = nsnull, you might as well test after you
try the do_GetService and the GetRuntime calls that rt is not null, and if it is
still null, bail (the roots must not matter, if the runtime is gone -- probably
some odd shutdown case, if it can happen at all).

Fix that, and a=brendan@mozilla.org

/be
(Assignee)

Comment 55

18 years ago
Alright, will do. Got a point there, after all.
Whiteboard: [nsbeta3-][PDTP2] [rtm+ need info] patch awaiting review → [nsbeta3-][PDTP2] [rtm+]

Comment 56

18 years ago
rtm++
Whiteboard: [nsbeta3-][PDTP2] [rtm+] → [nsbeta3-][PDTP2] [rtm++]
(Assignee)

Comment 57

18 years ago
.
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago
Resolution: --- → FIXED
(Assignee)

Comment 58

18 years ago
Oh right. I did have something to say. Bugzilla protests so much I tend to ignore 
him. Fixed on trunk and Netscape rtm branch.

Comment 59

18 years ago
FYI: Branch build 2000-10-16-09MN6, NT4. The crash in Mail that was mentioned in
bug# 47521 and 53628 no longer occurs.
can someone please verify this? I'm not qualifed to do this...
you will have problems reproduce this since the start up home page has changed
to a giant image map.
(Reporter)

Comment 62

18 years ago
vrf'd
Status: RESOLVED → VERIFIED

Comment 63

18 years ago
Sean: how did you verify this? did you test on mac, window and linux?  I think 
that the correct way to verify is to give an lxr url for the checkin.
(Reporter)

Comment 64

18 years ago
vrfd on Win NT and Win2K, using an undistributable Beatnik eMix and using the 
attached testcase.
checkin lxr url: http://bonsai.mozilla.org/cvsview2.cgi?
diff_mode=context&whitespace_mode=show&file=nsGlobalWindow.cpp&root=/cvsroot&sub
dir=mozilla/dom/src/base&command=DIFF_FRAMESET&rev1=1.354&rev2=1.355

Updated

17 years ago
Keywords: beatnik, rtm → nsrtm
Product: Browser → Seamonkey
You need to log in before you can comment on or make changes to this bug.