Closed Bug 634542 Opened 13 years ago Closed 13 years ago

Spike in crashes [@ call_trace ] among users with Firebug installed + [@ JSStackFrame::callObj() ]

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: scoobidiver, Assigned: luke)

References

Details

(Keywords: crash, regression, topcrash, Whiteboard: [hardblocker][has patch][landed already?])

Crash Data

Attachments

(1 file, 5 obsolete files)

It is a low volume crash, but from 4.0b12pre/20110215, there is a spike in crashes.
It is #4 top crasher in this build.

Signature	call_trace
UUID	d9f5abe0-92a7-4bac-9855-124262110216
Time 	2011-02-16 00:23:46.49567
Uptime	1628
Last Crash	1658 seconds (27.6 minutes) before submission
Install Age	1628 seconds (27.1 minutes) since version was first installed.
Product	Firefox
Version	4.0b12pre
Build ID	20110215030353
Branch	2.0
OS	Mac OS X
OS Version	10.6.4 10F569
CPU	amd64
CPU Info	family 6 model 15 stepping 10
Crash Reason	EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
Crash Address	0x6a
App Notes 	Renderers: 0x22600,0x20400

Frame 	Module 	Signature [Expand] 	Source
0 	XUL 	call_trace 	js/src/jsvalue.h:1199
1 	XUL 	js_TraceObject 	js/src/jsobj.cpp:6507
2 	XUL 	js::gc::MarkChildren 	js/src/jsgcinlines.h:289
3 	XUL 	js::MarkRuntime 	js/src/jsgcinlines.h:347
4 	XUL 	js_GC 	js/src/jsgc.cpp:2407
5 	XUL 	nsXPConnect::Collect 	js/src/xpconnect/src/nsXPConnect.cpp:407
6 	XUL 	nsXPConnect::GarbageCollect 	js/src/xpconnect/src/nsXPConnect.cpp:415
7 	XUL 	nsJSContext::CC 	dom/base/nsJSEnvironment.cpp:3400
8 	XUL 	nsJSContext::MaybeCC 	dom/base/nsJSEnvironment.cpp:3490
9 	XUL 	GCTimerFired 	dom/base/nsJSEnvironment.cpp:3469
10 	XUL 	nsTimerImpl::Fire 	xpcom/threads/nsTimerImpl.cpp:425
11 	XUL 	nsTimerEvent::Run 	xpcom/threads/nsTimerImpl.cpp:517
12 	XUL 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:633
13 	XUL 	NS_ProcessPendingEvents_P 	nsThreadUtils.cpp:200
14 	XUL 	nsBaseAppShell::NativeEventCallback 	widget/src/xpwidgets/nsBaseAppShell.cpp:132
15 	XUL 	nsAppShell::ProcessGeckoEvents 	widget/src/cocoa/nsAppShell.mm:399
16 	CoreFoundation 	CoreFoundation@0x4de90 	
17 	CoreFoundation 	CoreFoundation@0x4c088 	

The regression window is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=26421a3b68f3&tochange=2c646d10b9c7

More reports at:
https://crash-stats.mozilla.com/report/list?product=Firefox&range_value=4&range_unit=weeks&signature=call_trace
blocking2.0: --- → ?
This falls under random GC marking crashes.
Blocks: 613650
blocking2.0: ? → -
Having a single GC crash address rocket to #4 (almost tying #2) starting with a single nightly build seems causal.
This looks worrysome. Can we try to investigate the stacks? No need to block until we have STRs but early analysis would be good.
blocking2.0: - → ?
So far it is just a one-day spike: the volume on Feb 16 is low again. (There are plenty of crashes already today from the Feb 15 build.) I'd wait a day or two to see if it persists before bothering with it too much.
OK, will re-evaluate tomorrow.
> So far it is just a one-day spike
It is #2 top crasher in the Feb 16 build. Until now, crash report numbers are the following ones:
2011021600	27
2011021500	51
2011021400	1
2011020300	8
2011020100	1

Comments say:
"Was cmd + tab'ing from browser to text editor"
"Just closed google wave."
"this is cause in firefox 4 nightly and with Firebug a 1.7.ax11"
The 20 crashes I looked at all had Firebug installed.
Summary: Spike in crashes [@ call_trace ] → Spike in crashes [@ call_trace ] among users with Firebug installed
Yep, looks like all the users have either Firebug 1.7X.0a10 (released Feb. 11 12:58pm) or 1.7X.0a11 (released Feb. 16 2:12am), so it looks like the users seeing this are running the current Firebug alphas for Firefox 4.
Tentatively marking as a hardblocker since this looks like a "firebug makes me crash" scenario and its position in terms of overall crashiness might underestimate its importance.
blocking2.0: ? → final+
Summary: Spike in crashes [@ call_trace ] among users with Firebug installed → Spike in crashes [@ call_trace ]
Whiteboard: [hardblocker]
Summary: Spike in crashes [@ call_trace ] → Spike in crashes [@ call_trace ] among users with Firebug installed
Bug 614347 landed at 1pm on 2011-02-16, and I have a (small) hope that it might help.
Steve and I analyzed this, and it turns out Luke came to the same conclusion independently. The crash is inside call_trace:

        uintN count = fp->script()->bindings.countArgsAndVars();
        // fp->script() is a jsval tag, so we crash trying to dereference
        // |bindings|.

So it seems that a call object has a private that points to a random jsval instead of the stack frame. This could happen if the function that owns the call object exits without clearing the call object's fp pointer. Then it would be fairly likely that the target of the fp pointer would get overwritten with stack jsvals, causing us to crash just like this.

Given that all these seem happen with Firebug, we thought that it might have something to do with JS_EvaluateInStackFrame. EvalInStackFrame calls JS_GetScopeChain, which can create a call object on demand. In particular, if the current stack frame is an eval frame, then we would create a call object for it, but the script epilogue code doesn't clear call objects for non-strict eval frames. So that's the best guess right now for the exact cause. In any case, we strongly suspect that some function-return path skips clearing the call object's fp when it shouldn't.

Next steps are to try to repro this condition with Firebug and/or land diagnostic patches to test for that condition in the field.
Assignee: general → sphink
Status: NEW → ASSIGNED
Attached patch not fix, find bugs (obsolete) — Splinter Review
Unfortunately, the suspected bug in comment 11 wasn't a bug, so we still don't have a specific cause.  It still seems likely, though, that an eval frame is being given its own call object, so this is a patch to try to catch that bug close to the source and crash informatively.  Also, I added some code to take a copy of the JSStackFrame near the crash in call_trace to better diagnose crashes.  I'll wait for green on try, of course, but asking for approval to push to m-c.
Attachment #513325 - Flags: review?
Attachment #513325 - Flags: approval2.0?
Comment on attachment 513325 [details] [diff] [review]
not fix, find bugs

Approving late-stage telemetry.
Attachment #513325 - Flags: approval2.0? → approval2.0+
Comment on attachment 513325 [details] [diff] [review]
not fix, find bugs

Confirmed that the memcpy is not optimized away in MSVC. I think I've done it just like this before *except* I didn't make the global symbol public, and it didn't work. So that's the key. (Another way to do it is to write your own memcpy using volatile ptr arguments--that works too.)
Attachment #513325 - Flags: review? → review+
Dunno if I'm searching correctly or not, but I don't see any crashes coming in with the js_GetCallObject or PutActivationObjects signatures. I see a dozen more call_trace crashes today from last night's build. I haven't figured out how to get the crash dump data yet (my LDAP bit is flipped, so that's not the problem), so I don't know if Luke's frame copy is showing up properly.
I was also just looking at this.  Yeah, it seems none of the crash-asserts tripped.  The frame copy did indeed work, but it indicates a clobbered stack frame (which we already knew for the case where we crash getting fp->script; the copy was hoping that we'd crash at 0xbad (with a non-trash frame)).

I'm going to work on another telemetry patch.
FWIW, the 64-bit ones seem to be almost all near-null crashes (one at ~0LL, a bunch at 0x0, and a bunch more < 0xFF); the 32-bit ones are mostly within 0xFF of 0xffffffffffff0000 ... which I guess is Socorro for 0xffff0000?
Whiteboard: [hardblocker] → [hardblocker][needs telemetry]
Attached patch not fix, find bugs 2 (obsolete) — Splinter Review
This patch smuggles the originating creation site of the call object into the call object's flags bits.  It also records some chars of the call object's function's fun->atom/script->filename.  Lastly, it records whether this call object was passed up in some PutActivation objects call.
Attachment #513325 - Attachment is obsolete: true
Attachment #513672 - Flags: review?
Comment on attachment 513672 [details] [diff] [review]
not fix, find bugs 2

Gotta stop assuming :dmandelin works.
Attachment #513672 - Flags: review? → review?(dmandelin)
(In reply to comment #19)
> Created attachment 513672 [details] [diff] [review]
> not fix, find bugs 2
> 
> This patch smuggles the originating creation site of the call object into the
> call object's flags bits.  It also records some chars of the call object's
> function's fun->atom/script->filename.  Lastly, it records whether this call
> object was passed up in some PutActivation objects call.

Huh? The attached patch is the same as the previous, except it buffers the captured frame on both sides. Did you attach the right one?
Attached patch not fix, find bugs 2 (obsolete) — Splinter Review
Thanks for pointing that out Steve.  Patch looking green on try; anyone feel free to jump in and r+ so I can get this in before the nightly.
Attachment #513672 - Attachment is obsolete: true
Attachment #513699 - Flags: review?(dmandelin)
Attachment #513672 - Flags: review?(dmandelin)
Attachment #513699 - Flags: review?(dmandelin) → review+
There have been some crashes with useful data (some now in JSStackFrame::callObj (called by call_trace, right before call_trace would have crashed)).

The origin of the un-put Call object is JS_GetFrameCallObject, which means we are in the rare case that a call objects is created for a non-heavyweight function (which is exercised by firebug for breakpoints/watch exprs/etc).  
I scanned the relevant interp/mjit places for this case and nothing jumped out.  For at least one crash, the call object was created for some unnamed function in this (beautified and not-containing-personal-info) Facebook script: http://pastebin.mozilla.org/1082776.  I played around with Firebug on Facebook (breaking on this exact script) to no avail.
Attachment #513699 - Attachment is obsolete: true
Attachment #513812 - Flags: review?(dvander)
Attached patch not fix, find bugs 3 (obsolete) — Splinter Review
I keep hoping someone will pop up with a simple STR but, until then, the well isn't dry.  This patch checks some new key asserts in opt builds and collects more data about the source of the call object.
Attachment #513813 - Flags: review?(dvander)
Comment on attachment 513812 [details] [diff] [review]
patch to backout previous patches


>     JS_ASSERT(fp->isFunctionFrame() && !fp->isEvalFrame());
>+ 
>+    if (fp->isEvalFrame())
>+        *((int *)0x1337) = 0;

Is this a new diagnostic or part of the backout?
Attachment #513812 - Flags: review?(dvander) → review+
(In reply to comment #27)
> Is this a new diagnostic or part of the backout?

Oops, meant to back that out too, good catch.
> I keep hoping someone will pop up with a simple STR

not much useful in crash meta data.

Correlation to startup or time of session
42 total crashes for call_trace on 20110219-crashdata.csv
5 startup crashes inside 30 sec.
14 startup crashes inside 3 min.
7 repeated crashes inside 3 min. of last crash


urls for testing
  16 \N
   2 jar:file:///C:/Program%20Files%20(x86)/Minefield/omni.jar!/chrome/browser/content/browser/aboutHome.xhtml
   2 http://lmorchard.com/status/main/ostatussub
   1 wyciwyg://144/http://talkgadget.google.com/talkgadge
   1 jar:file:///C:/Program%20Files%20(x86)/Minefield/omni.jar!/chrome/browser/content/browser/aboutPrivateBrowsing.xhtml
   1 https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http...
   1 https://www.facebook.com/pagelet/generic.php/PersonalProfileWallTabPagelet?__a=3&ajaxpipe=1&data=%7B%22profile_id%22%3A516463201%2C%22q%22%3A%22lmorchard%22%2C%22filter%22%3A1%2C%22can_see_more%22%3Atrue%2C%22pager_load_count%22%3A1%7D
   1 https://0-176.channel.facebook.com/iframe/11?r=https%3A%2F%2Fs-static.ak.facebook.com%2Frsrc.php%2Fv1%2Fyg%2Fr%2FPzAcNRMfs_y.js&r=https%3A%2F%2Fs-static.ak.facebook.com%2Frsrc.php%2Fv1%2Fy-%2Fp%2Fr%2FggG67ARnbnR.js&r=https%3A%2F%2Fs-static.ak.facebook.com
   1 http://www.theonion.com/articles/area-dad-just-wants-computer-with-the-basics,19195/
   1 http://www.ligtv.com.tr/Default.aspx?r=1&hid=87601
   1 http://www.jamesming.com/corkhub/index.php/home/vendors
this should probably block the next beta since it seems to be a significant and sustained regression since the last beta starting around 2011021503 or maybe just a bit before.

         call_trace

date     total    breakdown by build
         crashes  count build, count build, ...

20110210   
20110211 2  	1 4.0b92011011019, 
        		1 4.0b112011020314, 
20110212 1 4.0b92011011019 	1 , 
20110213 3  	2 4.0b112011020314, 
        		1 4.0b82010121417, 
20110214 1 4.0b12pre2011021403 	1 , 
20110215 15  	13 4.0b12pre2011021503, 
        		2 4.0b112011020314, 
20110216 61  	37 4.0b12pre2011021503, 
        		21 4.0b12pre2011021603, 	2 4.0b112011020314, 
        		1 4.0b102011012116, 
20110217 41  	23 4.0b12pre2011021603, 
        		14 4.0b12pre2011021703, 	2 4.0b12pre2011021503, 
        		1 4.0b82010121417, 	1 4.0b12pre2011021712, 
20110218 49  	19 4.0b12pre2011021803, 
        		14 4.0b12pre2011021712, 	9 4.0b12pre2011021703, 
        		3 4.0b112011020314, 	2 4.0b12pre2011021603, 
        		1 4.0b92011011019, 	1 4.0b12pre2011021503, 
20110219 41  	23 4.0b12pre2011021803, 
        		9 4.0b12pre2011021903, 	4 4.0b12pre2011021712, 
        		3 4.0b12pre2011021703, 	1 4.0b82010121417, 
        		1 4.0b12pre2011021814,
>    2 http://lmorchard.com/status/main/ostatussub

Les, are you seeing this crash often enough that you could try to hit it?  If so, getting it in a debugger and/or a minidump/core dump would be probably be very helpful.
The JSStackFrame::callObj() crash signature caused by the diagnostic patch is currently #2 top crasher in 4.0b12pre.
STR:

1) Install Firebug 1.7a11
2) enable it for all sites
3) Open Google in tab 1
3) Open Google Docs in tab 2
4) Open Google Docs in tab 3
5) In tab 3, create a new presentation
6) Change the title, subtitle to something
7) Close tab 3 and immediately close tab 2

https://crash-stats.mozilla.com/report/index/bp-b3b5d55b-41bd-4b68-b820-1609e2110221
Btw, these STR may not be minimal. If you create a new presentation it opens in a new tab. And if I close the presentation I get a popup where I click "Leave page". This crashes every time, most often @call_trace, but sometimes @js_TraceObject: https://crash-stats.mozilla.com/report/index/bp-b2b49d12-6267-4ce0-a36f-e3d2a2110221

Let me know if you need more detailed steps or anything.
resetting the blocking flag so this gets reconsideration for beta blocking.

also noting that the JSStackFrame::callObj..  signature has a slightly later spike profile where the bump was not seen until builds from feb 19.

          JSStackFrame::callObj..

date     total    breakdown by build
         crashes  count build, count build, ...

20110215 6  	2 4.0b112011020314, 
        		1 4.0b92011011019, 	1 4.0b82010121417, 
        		1 4.0b12pre2011021403, 	1 4.0b102011012116, 
20110216 5  	4 4.0b112011020314, 
        		1 4.0b102011012116, 
20110217 1 4.0b112011020314 	1 , 
20110218 1 4.0b112011020314 	1 , 
20110219 16  	11 4.0b12pre2011021903, 
        		5 4.0b112011020314, 
20110220 52  	26 4.0b12pre2011022003, 
        		21 4.0b12pre2011021903, 	5 4.0b112011020314, 

combined, the two signatures are pretty close to the top crash.
Summary: Spike in crashes [@ call_trace ] among users with Firebug installed → Spike in crashes [@ call_trace ] among users with Firebug installed + [@ JSStackFrame::callObj() ]
blocking2.0: final+ → ?
(In reply to comment #35)
> Btw, these STR may not be minimal. If you create a new presentation it opens in
> a new tab. And if I close the presentation I get a popup where I click "Leave
> page". This crashes every time, most often @call_trace, but sometimes
> @js_TraceObject:
> https://crash-stats.mozilla.com/report/index/bp-b2b49d12-6267-4ce0-a36f-e3d2a2110221
> 
> Let me know if you need more detailed steps or anything.

(In reply to comment #34)
> STR:
> 
> 1) Install Firebug 1.7a11
> 2) enable it for all sites
> 3) Open Google in tab 1
> 3) Open Google Docs in tab 2
> 4) Open Google Docs in tab 3
> 5) In tab 3, create a new presentation
> 6) Change the title, subtitle to something
> 7) Close tab 3 and immediately close tab 2
> 
> https://crash-stats.mozilla.com/report/index/bp-b3b5d55b-41bd-4b68-b820-1609e2110221
Great, I was able to reproduce using these steps! Just to note that I have created new presentation in tab 4 and then closed 4, 3, 2 -> crash.

https://crash-stats.mozilla.com/report/index/1da02646-4996-4b74-aa07-605462110221

Honza
blocking2.0: ? → ---
Keywords: topcrash
blocking2.0: --- → ?
(In reply to comment #32)
> >    2 http://lmorchard.com/status/main/ostatussub
> 
> Les, are you seeing this crash often enough that you could try to hit it?  If
> so, getting it in a debugger and/or a minidump/core dump would be probably be
> very helpful.

I don't think I've gotten it in awhile. But, I'll take a look at getting Minefield running in a debugger - never done it before
Beautiful, can repro locally with a debug build.
Despite looking at the js patches in the regression range like 10 times, there it is: a missing call to PutActivationObjects on the new cleared-scope error path.
Attached patch fixSplinter Review
Backed out the diagnostics:
http://hg.mozilla.org/mozilla-central/rev/2ebeb6e9f036

This patch fixes the reproducible crash in call_trace for me and should, based on the regression range, fix the top crashes.  

With the error->report->firebug->dangling-call-object mindset, I can see several other (highly unlikely) error paths that could cause the same bug (perhaps this is the low-volume call_trace crash history).  Rather than requiring each error source PutActivationObjects (like this patch), I think instead we should refactor a bit so that PutActivationObjects is always called on the error path.

Perhaps we can land/verify this fix tonight and do the bigger/better patch next?
Assignee: sphink → lw
Attachment #513812 - Attachment is obsolete: true
Attachment #513813 - Attachment is obsolete: true
Attachment #514066 - Flags: review?(gal)
Attachment #513813 - Flags: review?(dvander)
Whiteboard: [hardblocker][needs telemetry] → [hardblocker][has patch]
Attachment #514066 - Flags: review?(gal) → review+
Landed the quick fix; will file follow-up bug for the rest of the stuff in comment 41.
http://hg.mozilla.org/mozilla-central/rev/73d941bd2ef2
Shouldn't it be a call to ScriptEpilogue()?
No, because ScriptPrologue hasn't been called yet for the frame.  I think that's the root of this family of bugs: creating the call object is done before the prologue, but cleaned up by the epilogue.
Depends on: 635805
This landed, so marking betaN+ for bookkeeping changes, but wondering why it isn't marked FIXED? Is that because the follow-up bug needs to be filed? Or was that the also already landed bug 635805?
blocking2.0: ? → betaN+
Whiteboard: [hardblocker][has patch] → [hardblocker][has patch][landed already?]
The bug is for a spike in nightly crashes so I was planning to watch crashes for a day to verify that the crashes went away / down to their original levels.  I am fine with marking the bug FIXED now though if that's the protocol.

Also, the followup was filed; forgot to link here: bug 635811.
I could not reproduce the bug using steps in comment #37 in the nightly
Built from http://hg.mozilla.org/mozilla-central/rev/1da3405c74fd
Honza
We can reopen if this didnt fix it.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Looks fixed; no call_trace crashes since 20110221030349.
Crash Signature: [@ call_trace ] [@ JSStackFrame::callObj() ]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: