Closed Bug 634542 Opened 9 years ago Closed 9 years ago
Spike in crashes [@ call
_trace ] among users with Firebug installed + [@ JSStack Frame::call Obj() ]
It is a low volume crash, but from 4.0b12pre/20110215, there is a spike in crashes. It is #4 top crasher in this build. Signature call_trace UUID d9f5abe0-92a7-4bac-9855-124262110216 Time 2011-02-16 00:23:46.49567 Uptime 1628 Last Crash 1658 seconds (27.6 minutes) before submission Install Age 1628 seconds (27.1 minutes) since version was first installed. Product Firefox Version 4.0b12pre Build ID 20110215030353 Branch 2.0 OS Mac OS X OS Version 10.6.4 10F569 CPU amd64 CPU Info family 6 model 15 stepping 10 Crash Reason EXC_BAD_ACCESS / KERN_INVALID_ADDRESS Crash Address 0x6a App Notes Renderers: 0x22600,0x20400 Frame Module Signature [Expand] Source 0 XUL call_trace js/src/jsvalue.h:1199 1 XUL js_TraceObject js/src/jsobj.cpp:6507 2 XUL js::gc::MarkChildren js/src/jsgcinlines.h:289 3 XUL js::MarkRuntime js/src/jsgcinlines.h:347 4 XUL js_GC js/src/jsgc.cpp:2407 5 XUL nsXPConnect::Collect js/src/xpconnect/src/nsXPConnect.cpp:407 6 XUL nsXPConnect::GarbageCollect js/src/xpconnect/src/nsXPConnect.cpp:415 7 XUL nsJSContext::CC dom/base/nsJSEnvironment.cpp:3400 8 XUL nsJSContext::MaybeCC dom/base/nsJSEnvironment.cpp:3490 9 XUL GCTimerFired dom/base/nsJSEnvironment.cpp:3469 10 XUL nsTimerImpl::Fire xpcom/threads/nsTimerImpl.cpp:425 11 XUL nsTimerEvent::Run xpcom/threads/nsTimerImpl.cpp:517 12 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:633 13 XUL NS_ProcessPendingEvents_P nsThreadUtils.cpp:200 14 XUL nsBaseAppShell::NativeEventCallback widget/src/xpwidgets/nsBaseAppShell.cpp:132 15 XUL nsAppShell::ProcessGeckoEvents widget/src/cocoa/nsAppShell.mm:399 16 CoreFoundation CoreFoundation@0x4de90 17 CoreFoundation CoreFoundation@0x4c088 The regression window is: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=26421a3b68f3&tochange=2c646d10b9c7 More reports at: https://crash-stats.mozilla.com/report/list?product=Firefox&range_value=4&range_unit=weeks&signature=call_trace
This falls under random GC marking crashes.
blocking2.0: ? → -
Having a single GC crash address rocket to #4 (almost tying #2) starting with a single nightly build seems causal.
This looks worrysome. Can we try to investigate the stacks? No need to block until we have STRs but early analysis would be good.
blocking2.0: - → ?
So far it is just a one-day spike: the volume on Feb 16 is low again. (There are plenty of crashes already today from the Feb 15 build.) I'd wait a day or two to see if it persists before bothering with it too much.
OK, will re-evaluate tomorrow.
> So far it is just a one-day spike It is #2 top crasher in the Feb 16 build. Until now, crash report numbers are the following ones: 2011021600 27 2011021500 51 2011021400 1 2011020300 8 2011020100 1 Comments say: "Was cmd + tab'ing from browser to text editor" "Just closed google wave." "this is cause in firefox 4 nightly and with Firebug a 1.7.ax11"
The 20 crashes I looked at all had Firebug installed.
Summary: Spike in crashes [@ call_trace ] → Spike in crashes [@ call_trace ] among users with Firebug installed
Yep, looks like all the users have either Firebug 1.7X.0a10 (released Feb. 11 12:58pm) or 1.7X.0a11 (released Feb. 16 2:12am), so it looks like the users seeing this are running the current Firebug alphas for Firefox 4.
Tentatively marking as a hardblocker since this looks like a "firebug makes me crash" scenario and its position in terms of overall crashiness might underestimate its importance.
blocking2.0: ? → final+
Summary: Spike in crashes [@ call_trace ] among users with Firebug installed → Spike in crashes [@ call_trace ]
Summary: Spike in crashes [@ call_trace ] → Spike in crashes [@ call_trace ] among users with Firebug installed
Bug 614347 landed at 1pm on 2011-02-16, and I have a (small) hope that it might help.
Steve and I analyzed this, and it turns out Luke came to the same conclusion independently. The crash is inside call_trace: uintN count = fp->script()->bindings.countArgsAndVars(); // fp->script() is a jsval tag, so we crash trying to dereference // |bindings|. So it seems that a call object has a private that points to a random jsval instead of the stack frame. This could happen if the function that owns the call object exits without clearing the call object's fp pointer. Then it would be fairly likely that the target of the fp pointer would get overwritten with stack jsvals, causing us to crash just like this. Given that all these seem happen with Firebug, we thought that it might have something to do with JS_EvaluateInStackFrame. EvalInStackFrame calls JS_GetScopeChain, which can create a call object on demand. In particular, if the current stack frame is an eval frame, then we would create a call object for it, but the script epilogue code doesn't clear call objects for non-strict eval frames. So that's the best guess right now for the exact cause. In any case, we strongly suspect that some function-return path skips clearing the call object's fp when it shouldn't. Next steps are to try to repro this condition with Firebug and/or land diagnostic patches to test for that condition in the field.
Assignee: general → sphink
Status: NEW → ASSIGNED
Unfortunately, the suspected bug in comment 11 wasn't a bug, so we still don't have a specific cause. It still seems likely, though, that an eval frame is being given its own call object, so this is a patch to try to catch that bug close to the source and crash informatively. Also, I added some code to take a copy of the JSStackFrame near the crash in call_trace to better diagnose crashes. I'll wait for green on try, of course, but asking for approval to push to m-c.
Comment on attachment 513325 [details] [diff] [review] not fix, find bugs Approving late-stage telemetry.
Attachment #513325 - Flags: approval2.0? → approval2.0+
Comment on attachment 513325 [details] [diff] [review] not fix, find bugs Confirmed that the memcpy is not optimized away in MSVC. I think I've done it just like this before *except* I didn't make the global symbol public, and it didn't work. So that's the key. (Another way to do it is to write your own memcpy using volatile ptr arguments--that works too.)
Attachment #513325 - Flags: review? → review+
Diagnosis landed: http://hg.mozilla.org/mozilla-central/rev/c1b04cd37871
Dunno if I'm searching correctly or not, but I don't see any crashes coming in with the js_GetCallObject or PutActivationObjects signatures. I see a dozen more call_trace crashes today from last night's build. I haven't figured out how to get the crash dump data yet (my LDAP bit is flipped, so that's not the problem), so I don't know if Luke's frame copy is showing up properly.
I was also just looking at this. Yeah, it seems none of the crash-asserts tripped. The frame copy did indeed work, but it indicates a clobbered stack frame (which we already knew for the case where we crash getting fp->script; the copy was hoping that we'd crash at 0xbad (with a non-trash frame)). I'm going to work on another telemetry patch.
FWIW, the 64-bit ones seem to be almost all near-null crashes (one at ~0LL, a bunch at 0x0, and a bunch more < 0xFF); the 32-bit ones are mostly within 0xFF of 0xffffffffffff0000 ... which I guess is Socorro for 0xffff0000?
Whiteboard: [hardblocker] → [hardblocker][needs telemetry]
This patch smuggles the originating creation site of the call object into the call object's flags bits. It also records some chars of the call object's function's fun->atom/script->filename. Lastly, it records whether this call object was passed up in some PutActivation objects call.
Comment on attachment 513672 [details] [diff] [review] not fix, find bugs 2 Gotta stop assuming :dmandelin works.
Attachment #513672 - Flags: review? → review?(dmandelin)
(In reply to comment #19) > Created attachment 513672 [details] [diff] [review] > not fix, find bugs 2 > > This patch smuggles the originating creation site of the call object into the > call object's flags bits. It also records some chars of the call object's > function's fun->atom/script->filename. Lastly, it records whether this call > object was passed up in some PutActivation objects call. Huh? The attached patch is the same as the previous, except it buffers the captured frame on both sides. Did you attach the right one?
Thanks for pointing that out Steve. Patch looking green on try; anyone feel free to jump in and r+ so I can get this in before the nightly.
Attachment #513699 - Flags: review?(dmandelin) → review+
Diagnostic 2: http://hg.mozilla.org/mozilla-central/rev/5dd4f10fb1fd
There have been some crashes with useful data (some now in JSStackFrame::callObj (called by call_trace, right before call_trace would have crashed)). The origin of the un-put Call object is JS_GetFrameCallObject, which means we are in the rare case that a call objects is created for a non-heavyweight function (which is exercised by firebug for breakpoints/watch exprs/etc). I scanned the relevant interp/mjit places for this case and nothing jumped out. For at least one crash, the call object was created for some unnamed function in this (beautified and not-containing-personal-info) Facebook script: http://pastebin.mozilla.org/1082776. I played around with Firebug on Facebook (breaking on this exact script) to no avail.
I keep hoping someone will pop up with a simple STR but, until then, the well isn't dry. This patch checks some new key asserts in opt builds and collects more data about the source of the call object.
Comment on attachment 513812 [details] [diff] [review] patch to backout previous patches > JS_ASSERT(fp->isFunctionFrame() && !fp->isEvalFrame()); >+ >+ if (fp->isEvalFrame()) >+ *((int *)0x1337) = 0; Is this a new diagnostic or part of the backout?
Attachment #513812 - Flags: review?(dvander) → review+
(In reply to comment #27) > Is this a new diagnostic or part of the backout? Oops, meant to back that out too, good catch.
> I keep hoping someone will pop up with a simple STR not much useful in crash meta data. Correlation to startup or time of session 42 total crashes for call_trace on 20110219-crashdata.csv 5 startup crashes inside 30 sec. 14 startup crashes inside 3 min. 7 repeated crashes inside 3 min. of last crash urls for testing 16 \N 2 jar:file:///C:/Program%20Files%20(x86)/Minefield/omni.jar!/chrome/browser/content/browser/aboutHome.xhtml 2 http://lmorchard.com/status/main/ostatussub 1 wyciwyg://144/http://talkgadget.google.com/talkgadge 1 jar:file:///C:/Program%20Files%20(x86)/Minefield/omni.jar!/chrome/browser/content/browser/aboutPrivateBrowsing.xhtml 1 https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http... 1 https://www.facebook.com/pagelet/generic.php/PersonalProfileWallTabPagelet?__a=3&ajaxpipe=1&data=%7B%22profile_id%22%3A516463201%2C%22q%22%3A%22lmorchard%22%2C%22filter%22%3A1%2C%22can_see_more%22%3Atrue%2C%22pager_load_count%22%3A1%7D 1 https://0-176.channel.facebook.com/iframe/11?r=https%3A%2F%2Fs-static.ak.facebook.com%2Frsrc.php%2Fv1%2Fyg%2Fr%2FPzAcNRMfs_y.js&r=https%3A%2F%2Fs-static.ak.facebook.com%2Frsrc.php%2Fv1%2Fy-%2Fp%2Fr%2FggG67ARnbnR.js&r=https%3A%2F%2Fs-static.ak.facebook.com 1 http://www.theonion.com/articles/area-dad-just-wants-computer-with-the-basics,19195/ 1 http://www.ligtv.com.tr/Default.aspx?r=1&hid=87601 1 http://www.jamesming.com/corkhub/index.php/home/vendors
this should probably block the next beta since it seems to be a significant and sustained regression since the last beta starting around 2011021503 or maybe just a bit before. call_trace date total breakdown by build crashes count build, count build, ... 20110210 20110211 2 1 4.0b92011011019, 1 4.0b112011020314, 20110212 1 4.0b92011011019 1 , 20110213 3 2 4.0b112011020314, 1 4.0b82010121417, 20110214 1 4.0b12pre2011021403 1 , 20110215 15 13 4.0b12pre2011021503, 2 4.0b112011020314, 20110216 61 37 4.0b12pre2011021503, 21 4.0b12pre2011021603, 2 4.0b112011020314, 1 4.0b102011012116, 20110217 41 23 4.0b12pre2011021603, 14 4.0b12pre2011021703, 2 4.0b12pre2011021503, 1 4.0b82010121417, 1 4.0b12pre2011021712, 20110218 49 19 4.0b12pre2011021803, 14 4.0b12pre2011021712, 9 4.0b12pre2011021703, 3 4.0b112011020314, 2 4.0b12pre2011021603, 1 4.0b92011011019, 1 4.0b12pre2011021503, 20110219 41 23 4.0b12pre2011021803, 9 4.0b12pre2011021903, 4 4.0b12pre2011021712, 3 4.0b12pre2011021703, 1 4.0b82010121417, 1 4.0b12pre2011021814,
> 2 http://lmorchard.com/status/main/ostatussub Les, are you seeing this crash often enough that you could try to hit it? If so, getting it in a debugger and/or a minidump/core dump would be probably be very helpful.
The JSStackFrame::callObj() crash signature caused by the diagnostic patch is currently #2 top crasher in 4.0b12pre.
STR: 1) Install Firebug 1.7a11 2) enable it for all sites 3) Open Google in tab 1 3) Open Google Docs in tab 2 4) Open Google Docs in tab 3 5) In tab 3, create a new presentation 6) Change the title, subtitle to something 7) Close tab 3 and immediately close tab 2 https://crash-stats.mozilla.com/report/index/bp-b3b5d55b-41bd-4b68-b820-1609e2110221
Btw, these STR may not be minimal. If you create a new presentation it opens in a new tab. And if I close the presentation I get a popup where I click "Leave page". This crashes every time, most often @call_trace, but sometimes @js_TraceObject: https://crash-stats.mozilla.com/report/index/bp-b2b49d12-6267-4ce0-a36f-e3d2a2110221 Let me know if you need more detailed steps or anything.
resetting the blocking flag so this gets reconsideration for beta blocking. also noting that the JSStackFrame::callObj.. signature has a slightly later spike profile where the bump was not seen until builds from feb 19. JSStackFrame::callObj.. date total breakdown by build crashes count build, count build, ... 20110215 6 2 4.0b112011020314, 1 4.0b92011011019, 1 4.0b82010121417, 1 4.0b12pre2011021403, 1 4.0b102011012116, 20110216 5 4 4.0b112011020314, 1 4.0b102011012116, 20110217 1 4.0b112011020314 1 , 20110218 1 4.0b112011020314 1 , 20110219 16 11 4.0b12pre2011021903, 5 4.0b112011020314, 20110220 52 26 4.0b12pre2011022003, 21 4.0b12pre2011021903, 5 4.0b112011020314, combined, the two signatures are pretty close to the top crash.
Summary: Spike in crashes [@ call_trace ] among users with Firebug installed → Spike in crashes [@ call_trace ] among users with Firebug installed + [@ JSStackFrame::callObj() ]
(In reply to comment #35) > Btw, these STR may not be minimal. If you create a new presentation it opens in > a new tab. And if I close the presentation I get a popup where I click "Leave > page". This crashes every time, most often @call_trace, but sometimes > @js_TraceObject: > https://crash-stats.mozilla.com/report/index/bp-b2b49d12-6267-4ce0-a36f-e3d2a2110221 > > Let me know if you need more detailed steps or anything. (In reply to comment #34) > STR: > > 1) Install Firebug 1.7a11 > 2) enable it for all sites > 3) Open Google in tab 1 > 3) Open Google Docs in tab 2 > 4) Open Google Docs in tab 3 > 5) In tab 3, create a new presentation > 6) Change the title, subtitle to something > 7) Close tab 3 and immediately close tab 2 > > https://crash-stats.mozilla.com/report/index/bp-b3b5d55b-41bd-4b68-b820-1609e2110221 Great, I was able to reproduce using these steps! Just to note that I have created new presentation in tab 4 and then closed 4, 3, 2 -> crash. https://crash-stats.mozilla.com/report/index/1da02646-4996-4b74-aa07-605462110221 Honza
blocking2.0: ? → ---
(In reply to comment #32) > > 2 http://lmorchard.com/status/main/ostatussub > > Les, are you seeing this crash often enough that you could try to hit it? If > so, getting it in a debugger and/or a minidump/core dump would be probably be > very helpful. I don't think I've gotten it in awhile. But, I'll take a look at getting Minefield running in a debugger - never done it before
Beautiful, can repro locally with a debug build.
Despite looking at the js patches in the regression range like 10 times, there it is: a missing call to PutActivationObjects on the new cleared-scope error path.
Backed out the diagnostics: http://hg.mozilla.org/mozilla-central/rev/2ebeb6e9f036 This patch fixes the reproducible crash in call_trace for me and should, based on the regression range, fix the top crashes. With the error->report->firebug->dangling-call-object mindset, I can see several other (highly unlikely) error paths that could cause the same bug (perhaps this is the low-volume call_trace crash history). Rather than requiring each error source PutActivationObjects (like this patch), I think instead we should refactor a bit so that PutActivationObjects is always called on the error path. Perhaps we can land/verify this fix tonight and do the bigger/better patch next?
Whiteboard: [hardblocker][needs telemetry] → [hardblocker][has patch]
Landed the quick fix; will file follow-up bug for the rest of the stuff in comment 41. http://hg.mozilla.org/mozilla-central/rev/73d941bd2ef2
Shouldn't it be a call to ScriptEpilogue()?
No, because ScriptPrologue hasn't been called yet for the frame. I think that's the root of this family of bugs: creating the call object is done before the prologue, but cleaned up by the epilogue.
This landed, so marking betaN+ for bookkeeping changes, but wondering why it isn't marked FIXED? Is that because the follow-up bug needs to be filed? Or was that the also already landed bug 635805?
blocking2.0: ? → betaN+
Whiteboard: [hardblocker][has patch] → [hardblocker][has patch][landed already?]
The bug is for a spike in nightly crashes so I was planning to watch crashes for a day to verify that the crashes went away / down to their original levels. I am fine with marking the bug FIXED now though if that's the protocol. Also, the followup was filed; forgot to link here: bug 635811.
I could not reproduce the bug using steps in comment #37 in the nightly Built from http://hg.mozilla.org/mozilla-central/rev/1da3405c74fd Honza
We can reopen if this didnt fix it.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Looks fixed; no call_trace crashes since 20110221030349.
Crash Signature: [@ call_trace ] [@ JSStackFrame::callObj() ]
You need to log in before you can comment on or make changes to this bug.