Closed
Bug 1034706
Opened 10 years ago
Closed 6 years ago
crash in js::jit::EnterBaselineMethod(JSContext*, js::RunState&)
Categories
(Core :: JavaScript Engine: JIT, defect, P3)
Tracking
()
RESOLVED
INCOMPLETE
Tracking | Status | |
---|---|---|
e10s | - | --- |
firefox30 | --- | unaffected |
firefox31 | - | wontfix |
firefox32 | + | wontfix |
firefox33 | + | wontfix |
firefox34 | --- | wontfix |
firefox35 | --- | wontfix |
firefox39 | - | wontfix |
firefox41 | --- | wontfix |
firefox42 | --- | wontfix |
firefox43 | --- | wontfix |
firefox44 | --- | wontfix |
firefox45 | --- | wontfix |
firefox46 | + | wontfix |
firefox47 | + | wontfix |
firefox48 | --- | wontfix |
firefox49 | --- | wontfix |
firefox-esr45 | --- | wontfix |
firefox50 | --- | wontfix |
firefox51 | --- | wontfix |
firefox52 | --- | wontfix |
People
(Reporter: u279076, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: crash, Whiteboard: [native-crash])
Crash Data
This bug was filed from the Socorro interface and is report bp-23cc14e5-2e4f-4f96-ab95-2cf572140627. ============================================================= 0 mozjs.dll js::jit::EnterBaselineMethod(JSContext *,js::RunState &) js/src/jit/BaselineJIT.cpp 1 mozjs.dll Interpret js/src/vm/Interpreter.cpp 2 mozjs.dll js::RunScript(JSContext *,js::RunState &) js/src/vm/Interpreter.cpp 3 mozjs.dll js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) js/src/vm/Interpreter.cpp 4 mozjs.dll js_fun_apply(JSContext *,unsigned int,JS::Value *) js/src/jsfun.cpp 5 mozjs.dll js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) js/src/vm/Interpreter.cpp 6 mozjs.dll Interpret js/src/vm/Interpreter.cpp 7 mozjs.dll js::RunScript(JSContext *,js::RunState &) js/src/vm/Interpreter.cpp 8 mozjs.dll js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) js/src/vm/Interpreter.cpp 9 mozjs.dll js_fun_apply(JSContext *,unsigned int,JS::Value *) js/src/jsfun.cpp 10 mozjs.dll js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) js/src/vm/Interpreter.cpp 11 mozjs.dll Interpret js/src/vm/Interpreter.cpp 12 mozjs.dll js::RunScript(JSContext *,js::RunState &) js/src/vm/Interpreter.cpp 13 mozjs.dll js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) js/src/vm/Interpreter.cpp 14 mozjs.dll js::Invoke(JSContext *,JS::Value const &,JS::Value const &,unsigned int,JS::Value const *,JS::MutableHandle<JS::Value>) js/src/vm/Interpreter.cpp 15 mozjs.dll JS::Call(JSContext *,JS::Handle<JS::Value>,JS::Handle<JS::Value>,JS::HandleValueArray const &,JS::MutableHandle<JS::Value>) js/src/jsapi.cpp 16 xul.dll mozilla::dom::EventListener::HandleEvent(JSContext *,JS::Handle<JS::Value>,mozilla::dom::Event &,mozilla::ErrorResult &) obj-firefox/dom/bindings/EventListenerBinding.cpp 17 xul.dll mozilla::dom::EventListener::HandleEvent<mozilla::dom::EventTarget *>(mozilla::dom::EventTarget * const &,mozilla::dom::Event &,mozilla::ErrorResult &,mozilla::dom::CallbackObject::ExceptionHandling) obj-firefox/dist/include/mozilla/dom/EventListenerBinding.h 18 xul.dll mozilla::EventListenerManager::HandleEventSubType(mozilla::EventListenerManager::Listener *,nsIDOMEvent *,mozilla::dom::EventTarget *) dom/events/EventListenerManager.cpp 19 xul.dll mozilla::EventTargetChainItem::HandleEventTargetChain(nsTArray<mozilla::EventTargetChainItem> &,mozilla::EventChainPostVisitor &,mozilla::EventDispatchingCallback *,mozilla::ELMCreationDetector &) dom/events/EventDispatcher.cpp 20 xul.dll mozilla::EventDispatcher::Dispatch(nsISupports *,nsPresContext *,mozilla::WidgetEvent *,nsIDOMEvent *,nsEventStatus *,mozilla::EventDispatchingCallback *,nsCOMArray<mozilla::dom::EventTarget> *) dom/events/EventDispatcher.cpp ============================================================= More reports: https://crash-stats.mozilla.com/report/list?product=Firefox&signature=js%3A%3Ajit%3A%3AEnterBaselineMethod%28JSContext%2A%2C+js%3A%3ARunState%26%29 This is the same signature as a recent B2G topcrasher (bug 978450) but affects Desktop Firefox. This signature has been around on Desktop for a long time but has recently exploded on Beta by an extreme margin starting on 2014-07-02. https://crash-analysis.mozilla.com/rkaiser/2014-07-03/2014-07-03.firefox.31.explosiveness.html This is currently #37 across 7-days and #22 across 3-days. While not strictly a "topcrash" yet I'm marking it as such based on explosiveness. Looking at the product correlation the volume is really high on the latest Beta compared to the previous Beta, and is really high on the latest Nightly compared to the latest Aurora. > Firefox 31.0b6: 55.56% > Firefox 33.0a1: 32.98% > Firefox 32.0a2: 5.38% > Firefox 31.0b5: 2.05% Crashes per Install seems to indicate people are crashing here more than once: > Firefox 31.0b6: 785 crashes per 622 installs > Firefox 33.0a1: 466 crashes per 214 installs Facebook seems to be the top URL in the correlations by far.
Updated•10 years ago
|
status-firefox30:
--- → unaffected
Comment 1•10 years ago
|
||
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #0) > This bug was filed from the Socorro interface and is > report bp-23cc14e5-2e4f-4f96-ab95-2cf572140627. > ============================================================= > 0 mozjs.dll js::jit::EnterBaselineMethod(JSContext *,js::RunState &) > js/src/jit/BaselineJIT.cpp > 1 mozjs.dll Interpret js/src/vm/Interpreter.cpp > 2 mozjs.dll js::RunScript(JSContext *,js::RunState &) > ============================================================= In general such stack (EnterBaselineMethod) is useless as we enter some generated code, we we do not know what code is being executed when these crashes are happening. > Looking at the product correlation the volume is really high on the latest > Beta compared to the previous Beta, and is really high on the latest Nightly > compared to the latest Aurora. > > > Firefox 31.0b6: 55.56% > > Firefox 33.0a1: 32.98% > > Firefox 32.0a2: 5.38% > > Firefox 31.0b5: 2.05% > Changelog from Firefox 31.0b5 to Firefox 31.0b6: http://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=a04918ac3197&tochange=9f7d43269809 Terrence, could that be related to Bug 1028358? Anthony, can somebody from QA find a way to reproduce this issue?
Flags: needinfo?(terrence)
Flags: needinfo?(anthony.s.hughes)
(In reply to Nicolas B. Pierron [:nbp] from comment #1) > Anthony, can somebody from QA find a way to reproduce this issue? There's really nothing useful in any of the reports to help guide testing. Is there anything in the pushlog which stands out that we could test around?
Flags: needinfo?(anthony.s.hughes)
Comment 3•10 years ago
|
||
(In reply to Nicolas B. Pierron [:nbp] from comment #1) > > Changelog from Firefox 31.0b5 to Firefox 31.0b6: > http://hg.mozilla.org/releases/mozilla-beta/ > pushloghtml?fromchange=a04918ac3197&tochange=9f7d43269809 > > Terrence, could that be related to Bug 1028358? I don't think so. That barrier code is not used by the jits, it would only increase the live set anyway, and the crash is a null deref, not a UAF. I don't think GC is likely to be implicated here. > Anthony, can somebody from QA find a way to reproduce this issue?
Flags: needinfo?(terrence)
Comment 4•10 years ago
|
||
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #2) > (In reply to Nicolas B. Pierron [:nbp] from comment #1) > > Anthony, can somebody from QA find a way to reproduce this issue? > > There's really nothing useful in any of the reports to help guide testing. No, reports with EnterBaseline are just saying “Hey we are executing some JavaScript that we have executed more than 10 times before”. Which does not help to find what is the context of the failure. > Is there anything in the pushlog which stands out that we could test around? I look at it, and the only commit which stand out is Bug 1028358, but Terrence replied to this hypothesis in comment 3. The other option would be that this is something new in facebook pages (comment 0), which is causing more failures by highlighting one existing bug which might be in the tree since a moment.
Comment 5•10 years ago
|
||
Untracking. No activity and too late for 31
Comment 6•9 years ago
|
||
I'm marking as won't fix for 32 as there has been no activity. ni Naveed to help get this top crash prioritized.
Updated•9 years ago
|
Flags: needinfo?(nihsanullah)
Comment 9•9 years ago
|
||
(In reply to Sylvestre Ledru [:sylvestre] from comment #8) > Jan, so, how do the stats look like? Thanks The fix mentioned in comment 7 helped a bit (and is in 32). But EnterBaselineMethod is still at #4 for 32, #10 for 33. Unfortunately (top-)crashes in JIT code are not a new thing; we've had them since the first Firefox releases with a JIT. I looked at some of the crash reports recently and most of those were caused by memory corruption that's impossible to track down... It could even be code outside the JS engine that's misbehaving and corrupting our code. I'll keep an eye on crash-stats though.
Flags: needinfo?(jdemooij)
Comment 10•9 years ago
|
||
OK. Thanks for the feedback. I guess this is going to be a wontfix for 33.
Comment 11•9 years ago
|
||
Wontfix for 33 then.
Comment 12•9 years ago
|
||
Given comment 9, is there anything else that we can do in this bug?
Flags: needinfo?(jdemooij)
Comment 13•9 years ago
|
||
(In reply to Lawrence Mandel [:lmandel] from comment #12) > Given comment 9, is there anything else that we can do in this bug? If there's a new spike or a website that crashes reliably we'd be happy to investigate and fix it, but the current crashes look like random memory corruption and there's not much we can do. This bug is not really actionable, so I don't know if we should track it.
Flags: needinfo?(jdemooij)
Comment 14•9 years ago
|
||
Kairo/Anthony - Is this still a topcrash in 33/34/35? If so, is there any more information that you can provide to assist with debugging? If not, this looks like a resolved/incomplete to me.
Flags: needinfo?(kairo)
Flags: needinfo?(anthony.s.hughes)
Reporter | ||
Comment 15•9 years ago
|
||
I looked over the stats for this signature and this does not seem to qualify as a topcrash anymore, though it is still affecting some users. > 33.0*: 90 reports > 34.0*: 28 reports > 35.0*: 1 report > 36.0*: 25 reports https://crash-stats.mozilla.com/report/list?product=Firefox&range_value=7&range_unit=days&date=2014-10-22&signature=js%3A%3Ajit%3A%3AEnterBaselineMethod%28JSContext*%2C+js%3A%3ARunState%26%29
Comment 16•9 years ago
|
||
Given the data in comment 15 and the lack of additional information for debugging, I think this can likely be resolved. I want to wait until at least tomorrow to give Kairo a chance to comment.
![]() |
||
Comment 17•9 years ago
|
||
Well, if we resolve it, we might need another bug for tracking the ongoing (but unactionable probably) amount of crashes we have all the time with this signature, which probably in reality is all kinds of different things crashing actually *inside* baseline-compiled code.
Comment 18•9 years ago
|
||
I'm going to leave this open so that we have somewhere to track (per Kairo in comment 17) but am dropping tracking as this is currently inactionable.
tracking-firefox34:
+ → ---
tracking-firefox35:
+ → ---
Comment 19•9 years ago
|
||
This signature now affects Developer Edition 39.0a2 2015-03-30 win32 builds under Windows at start-up. The builds are unusable. Win64 builds are not affected under Windows. Linux and Mac builds can be started and used.
status-firefox39:
--- → affected
Comment 20•9 years ago
|
||
Naveed, seems like we need your help! Could you help us with that? Thanks (this is critical as we cannot reenable 39 aurora updates).
tracking-firefox39:
--- → +
Flags: needinfo?(nihsanullah)
Comment 21•9 years ago
|
||
Ugh, we have the same issue in automation at the moment in bug 1149377. I'm working on bisecting it now, but being pgo-only isn't helping.
See Also: → 1149377
Comment 22•9 years ago
|
||
FWIW, this bug clearly pre-dates whatever's going on with Aurora since yesterday's uplift. I think we should track the new problem over in bug 1149377 rather than this one.
![]() |
||
Comment 24•9 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #22) > FWIW, this bug clearly pre-dates whatever's going on with Aurora since > yesterday's uplift. I think we should track the new problem over in bug > 1149377 rather than this one. Yes, the signature in here is pretty much a catch-all for a class of crashes in the Baseline JIT.
Comment 25•9 years ago
|
||
nbp and jandem are working on the current issue in bug 1149377. For the next time we end up here: This stack by itself (and therefore this specific bug) is not really actionable. It may imply a code generation problem or an exception occurred while processing warm JS. Bisection or another hint will probably be needed to work the issue and a more specific bug should be opened.
Flags: needinfo?(nihsanullah)
Comment 26•8 years ago
|
||
(In reply to Naveed Ihsanullah [:naveed] from comment #25) > nbp and jandem are working on the current issue in bug 1149377. > > For the next time we end up here: This stack by itself (and therefore this > specific bug) is not really actionable. It may imply a code generation > problem or an exception occurred while processing warm JS. Bisection or > another hint will probably be needed to work the issue and a more specific > bug should be opened. ¡Hola Naveed! FWIW I've filed https://bugzilla.mozilla.org/show_bug.cgi?id=1200685 Hope it is useful else let me know and I'd close it =) ¡Gracias!
Flags: needinfo?(nihsanullah)
Comment 27•8 years ago
|
||
Ill pass the bug on to Jan. I don't see any additional actionable information in that bug but perhaps Jan can tell more. Jan can we instrument the code for these class of crashes so more information is available to us in the crash reports?
Flags: needinfo?(nihsanullah)
Updated•8 years ago
|
Flags: needinfo?(jdemooij)
Updated•8 years ago
|
Crash Signature: [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)] → [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)]
[@ js::jit::EnterBaselineMethod]
Comment 29•8 years ago
|
||
"Assignee:" taken over from Bug 1200685.
Assignee: nobody → jdemooij
Blocks: shutdownkill
status-firefox41:
--- → ?
status-firefox42:
--- → ?
status-firefox43:
--- → affected
status-firefox44:
--- → ?
status-firefox45:
--- → ?
Whiteboard: ShutDownKill
Comment 31•8 years ago
|
||
From Bug 956980 ... Summary: crash in js::jit::EnterBaselineMethod(JSContext*, js::RunState&) mostly with cached documents https://bugzilla.mozilla.org/show_bug.cgi?id=956980#c0 (In reply to Kevin Brosnan [:kbrosnan] from comment #0) > This bug was filed from the Socorro interface and is > report bp-77126f40-348c-46eb-9f74-79c772140106. > ============================================================= > > Nothing useful in comments. Almost all the URLs have wyciwyg which suggests > the documents were retrieved from the cache. Wired URLs represent 10 out of > the 13 submitted URLs. > > wyciwyg://0/http://www.wired.com/opinion/2013/11/so-the-internets-about-to- > lose-its-net-neutrality/ > > wyciwyg://0/http://www.wired.com/opinion/2012/11/cease-and-desist-manuals- > planned-obsolescence/ > > There are two non-cache URLs and those are > > http://www.photoprikol.net/photo/138-igrushki-sssr-72-foto.html > > https://www.facebook.com/
Whiteboard: ShutDownKill → [native-crash], ShutDownKill
Comment 32•8 years ago
|
||
+ Emails from the dups ...
Comment 33•8 years ago
|
||
(In reply to Naveed Ihsanullah [:naveed] from comment #27) > Ill pass the bug on to Jan. I don't see any additional actionable > information in that bug but perhaps Jan can tell more. > > Jan can we instrument the code for these class of crashes so more > information is available to us in the crash reports? Yeah these crashes aren't really actionable. JIT crashes are caused by different bugs and many of the reports are random memory corruption. We want to hear about spikes and reproducible cases though. Making JIT code non-writable may help us catch memory corruption bugs sooner/elsewhere. That's bug 1215479 but it's pretty hard to do without regressing performance.
Flags: needinfo?(jdemooij)
Comment 34•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #33) > Making JIT code non-writable may help us catch memory corruption bugs > sooner/elsewhere. That's bug 1215479 but it's pretty hard to do without > regressing performance. Could we only re-protect the code 1/10th of the time? Thus, amortize the cost of protecting the pages, and potentially catch some of these other issues without huge performance regressions, while providing a better crash-stack.
Comment 35•8 years ago
|
||
From the crash signature [@ js::jit::EnterBaselineMethod ], the affected versions are: - Nightly: 47 - Aurora: 46, 45 - Beta: 45.0b1, 45.0b2, 44.0b99, 44.0b1, 44.0b9, 44.0b8, 44.0b6, 44.0b2, 44.0b7 In the crash signature [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&) ] there are no reports in the last 28 days.
Updated•8 years ago
|
Blocks: e10s-crashes
Updated•8 years ago
|
tracking-e10s:
--- → ?
Comment 36•8 years ago
|
||
Currently, for the past 7 days, there are 2800 crashes reported for beta and only 12 reported on nightly for [@ js::jit::EnterBaselineMethod]
Comment 37•8 years ago
|
||
important |
(In reply to [:tracy] Tracy Walker from comment #36) > Currently, for the past 7 days, there are 2800 crashes reported for beta and > only 12 reported on nightly for [@ js::jit::EnterBaselineMethod] As mentioned all along this bug, this signature is not actionable. To investigate such issues, here are some of the fastest ways forward: - Reproduce the issue with one of the reported URL. - List all backported patches, since the last version. (comment 25) - Find an actionable existing bug which highlights the same crash characteristics (crash address, stack pointer, …). With none of these information, I would not expect any investigation from the JS Team as we are likely to arm our users more with random urgent fixes.
![]() |
||
Comment 38•8 years ago
|
||
This is a generic crash that doesn't appear to afflict e10s more or less than non-e10s. Untracking. 45.0b6 content process crashes - 228 45.0b6 crashes with e10s disabled - 2165 The percentage of beta users running e10s during our experiment was about 10%.
Updated•8 years ago
|
No longer blocks: e10s-crashes
![]() |
||
Comment 39•8 years ago
|
||
Looking at beta 46 (5, 6, 7) experiment crash data, this shows up twice as often under e10s. It is also the #8 top crasher.
Blocks: e10s-crashes
![]() |
||
Comment 40•8 years ago
|
||
Jan, any suggestions here on how to proceed with this under e10s? https://crash-stats.mozilla.com/search/?product=Firefox&version=46.0b7&version=46.0b6&version=46.0b5&dom_ipc_enabled=!__null__&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Flags: needinfo?(jdemooij)
Comment 41•8 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #40) > Jan, any suggestions here on how to proceed with this under e10s? I looked at some of these beta crash dumps and it's the usual mix. Most common are: * Valid JIT code but some invalid bytes in the middle. This is pretty weird and code corruption should be less likely now with W^X (also in 46). I'm currently tracking down a related crash in bug 1124397. That's probably some other thread misbehaving. I'll continue working on that one. * Valid JIT code but reading/writing invalid memory. JIT code accesses a lot of things and this is probably similar to the GC topcrashes we have. * Some crashes remind me of bug 1260721. I'll see what we can do there. Unfortunately most of these look like random memory corruption. If these crashes are worse with e10s, maybe we have some heap corruption bugs there?
Flags: needinfo?(jdemooij)
Comment 42•8 years ago
|
||
Still a current problem. Firefox crashes in less than 2 mins after opening. Open a new tab, open a browser. Many times, when the page attemps to render it crashes. About 5 different crash reasons. Says not a plugin crash
#2 topcrash on 46 release right now (pretty high volume, just under OOM crashes). e10s should be disabled on release. People are complaining that they are hitting the crash after updating. The crash spike may also be correlated with AV software (see bug 1268025)
Flags: needinfo?(jdemooij)
Comment 44•8 years ago
|
||
Today I looked at about 80 crash dumps for EnterBaselineMethod crashes (Firefox 46.0, date >= 2016-05-01, uptime > 5000) and tried to group them. Here are the largest buckets: ----- (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD family 20 model 2 stepping 0" CPU. These crashes are all similar: we're executing the following Baseline type monitor stub: cmp $0xffffff88,%ecx jne L cmp %edx,0x10(%edi) jne L ret L: mov 0x4(%edi),%edi jmp *(%edi) The first instruction is the one where we crash (EXCEPTION_ACCESS_VIOLATION_READ or EXCEPTION_ACCESS_VIOLATION_WRITE with a low address like 0x168). Yes, that makes no sense: this compare instruction does not access any memory. I don't see crashes in this code with any other CPU. It's not the first time this processor is causing trouble, see bug 772330 and also bug 1264188 (although the latter is mostly model 1 and this is model 2). I wonder if this could be erratum 688 or a similar bug - Baseline stubs definitely use a lot of indirect jumps and calls. Example crash: bp-c70d9601-a96e-442b-ac05-d0ab52160501 Not sure what we should do here - we could try to emit some NOPS between the jumps and see if that helps... ----- (2) At least 8% (7 reports) are caused by a single bit flip in ICEntry pointers in Baseline code. Baseline code calls into ICs for most bytecode ops, so a typical Baseline script has sequences of: mov $0x6675cbcc,%edi <- ICEntry 1 mov (%edi),%edi call *(%edi) .. mov $0x6675cbd8,%edi <- ICEntry 2 mov (%edi),%edi call *(%edi) .. mov $0x6675cae4,%edi <- ICEntry 3 mov (%edi),%edi call *(%edi) <== crash Notice that there are 12 bytes (that's sizeof(ICEntry) on x86) between ICEntry 1 and ICEntry 2. ICEntry 3 is bogus: it should be 0x6675cbe4 but it is 0x6675cae4 -- 1 bit was flipped. These bit flips in ICEntry pointers are surprisingly common. We should probably add checks for this. Not sure what else we can do. (This particular crash is bp-4a6a05ac-f0b7-4f75-b41f-50fbf2160501.) ----- (3) At least 15% (13 reports) are bit flips in JIT code (either instructions or labels), for instance: - Exhibit 1: bp-2639a76f-172c-47d2-81b4-a01162160501 cmp $0x1000000,%ebx jb 0x11a7f4e2 cmp $0xffffff88,%ecx jne 0x11a7f4d2 This is part of a post barrier in JIT code. The second jump offset should be the same as the first jump, but a bit was flipped so instead it jumps in the middle of an instruction. (At 0x11a7f4d2 we have a 0xfb byte, that's an STI instruction that's invalid in user mode, so we crash with EXCEPTION_PRIV_INSTRUCTION.) - Exhibit 2: bp-6a2f6ba3-7ac7-4737-a4c9-d21542160503 1e91016: bf e0 20 85 0a mov $0xa8520e0,%edi 1e9101b: 8b 3f mov (%edi),%edi 1e9101d: ff 17 call *(%edi) 1e9101f: bf ec 20 85 0a mov $0xa8520ec,%edi 1e91024: 8b 3f mov (%edi),%edi 1e91026: ff 1f lcall *(%edi) The last instruction is where we crash: a bitflip (0x17 -> 0x1f) corrupted a call instruction ("lcall" makes no sense). There are many similar bitflips. ----- (4) At least 14% (12 reports) are EXCEPTION_ACCESS_VIOLATION_EXEC while trying to execute memory that doesn't look like JIT code. Likely random pages that we attempt to execute because of a bug somewhere. Many of these are probably caused by bit flips (the previous 2 categories) and we happened to end up in mapped memory instead of crashing immediately. ----- These 4 buckets cover about 50% or so. The remaining crashes are harder to categorize, but I think a good chunk of them are caused by similar memory corruption. I did see some crashes where we have for instance a Value with object type tag and nullptr payload, but because there are so few of them it's not clear what's going on.
Flags: needinfo?(jdemooij)
![]() |
||
Comment 45•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #44) > Today I looked at about 80 crash dumps for EnterBaselineMethod crashes > (Firefox 46.0, date >= 2016-05-01, uptime > 5000) and tried to group them. > > Here are the largest buckets: > > ----- > > (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD > family 20 model 2 stepping 0" CPU. These crashes are all similar: we're > executing the following Baseline type monitor stub: > > cmp $0xffffff88,%ecx > jne L > cmp %edx,0x10(%edi) > jne L > ret > L: > mov 0x4(%edi),%edi > jmp *(%edi) > > The first instruction is the one where we crash > (EXCEPTION_ACCESS_VIOLATION_READ or EXCEPTION_ACCESS_VIOLATION_WRITE with a > low address like 0x168). Yes, that makes no sense: this compare instruction > does not access any memory. > > I don't see crashes in this code with any other CPU. It's not the first time > this processor is causing trouble, see bug 772330 and also bug 1264188 > (although the latter is mostly model 1 and this is model 2). I wonder if > this could be erratum 688 or a similar bug - Baseline stubs definitely use a > lot of indirect jumps and calls. > > Example crash: bp-c70d9601-a96e-442b-ac05-d0ab52160501 > > Not sure what we should do here - we could try to emit some NOPS between the > jumps and see if that helps... This sounds kind of similar to bug 772330 comment 22, where dmajor describes an AMD CPU errata. The errata is in this doc: http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf The description says: "Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly update the branch status when a taken branch occurs where the first or second instruction after the branch is an indirect call or jump. This may cause the processor to update the rIP (the instruction pointer register) after a not-taken branch that ends on the last byte of an aligned quad-word such that it appears the processor skips, and does not execute, one or more instructions. The new updated rIP due to this erratum may not be at an instruction boundary" It's not a great matchup, but if these sort of crashes are *all* on AMD chips, the above might be plausible...
![]() |
||
Comment 46•8 years ago
|
||
...and if I had read more closely, I would have seen that you referenced that exact bug and errata. =/
![]() |
||
Comment 47•8 years ago
|
||
Random idea: What if we had a system that allocated a few scattered MiB (i.e., not all in one contiguous run or always at the same address, though being careful not to unduly increase fragmentation) with a predictable bit-pattern and periodically (say, on the daily telemetry or update ping) the system scanned all those MiBs to ensure they still had the same bit-pattern. If a bitflip was detected, we set a flag in the browser that gets included in crash reports and also persists between browser restarts (at least for a period of time). This could help us confirm a correlation between these catch-all JIT/GC crashes and the corruption flag and also have separate bins so that spikes in non-corruption-correlated crashes get more attention. If we wanted to get fancy, we could even pop up a notification to the user suggesting they have bad RAM if they had the corruption flag and they were experiencing crashes :)
Comment 48•8 years ago
|
||
This isn't a shutdown crash as far as I can tell.
No longer blocks: shutdownkill
Whiteboard: [native-crash], ShutDownKill → [native-crash]
Comment 49•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #44) > Today I looked at about 80 crash dumps for EnterBaselineMethod crashes > (Firefox 46.0, date >= 2016-05-01, uptime > 5000) and tried to group them. Awesome work! > The first instruction is the one where we crash > (EXCEPTION_ACCESS_VIOLATION_READ or EXCEPTION_ACCESS_VIOLATION_WRITE with a > low address like 0x168). Yes, that makes no sense: this compare instruction > does not access any memory. Can we detect this cpu familly, and use the segfault handler to resume the execution? In a similar way as operating system are emulating old instructions on newer generations of cpu. > […] are caused by a single bit flip […] Luke suggestion sounds interesting. I recall people mentioning doing a memcheck as part of the safe-mode. I do not know what cost this would have, but maybe this is something we can (randomly) do when we allocate new memory pages.
Comment 50•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #44) > (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD > family 20 model 2 stepping 0" CPU. These crashes are all similar: we're > executing the following Baseline type monitor stub: ... > I don't see crashes in this code with any other CPU. It's not the first time > this processor is causing trouble, see bug 772330 and also bug 1264188 > (although the latter is mostly model 1 and this is model 2). I wonder if > this could be erratum 688 or a similar bug - Baseline stubs definitely use a > lot of indirect jumps and calls. My experience with the AMD crashes has been that some of them show up mostly on model 1, and others show up mostly on model 2. > Not sure what we should do here - we could try to emit some NOPS between the > jumps and see if that helps... That could actually be very interesting, especially if you understand what the alignment conditions described in the erratum are referring to. (alignment of what?) It seems like it might be entirely possible to make these crashes go away for JIT-generated code by changing the alignment of jumps. > These bit flips in ICEntry pointers are surprisingly common. We should > probably add checks for this. Not sure what else we can do. Yeah, I've seen a bunch of other crashes recently that were the result of bit flips in memory. Though I'm curious if it's consistent with random memory being bad for such a high proportion of such crashes to end up at this particular spot. Are there a large number of these pointers? (For the bit flips being in JIT code... it seems more obvious to me that that's a big use of memory.)
![]() |
||
Comment 51•8 years ago
|
||
Thank you for the detailed analysis, Jan. The bitflips are scary. Jan, are you assuming that it's faulty hardware that's the cause? It sounds like others are assuming that but I can't tell if that's what you think. I think running a memtest on certain circumstances is a great idea. How hard is it write a memtest? What circumstances would you run it under? Do we have a bug open for this idea?
![]() |
||
Comment 52•8 years ago
|
||
I remember dolske was experimenting with running a memtest a few years ago; I don't know what happened there. I don't have any bugs on file -- just an idea while reading Jan's very interesting analysis -- sorry, don't mean to derail to more targeted discussion here.
Comment 53•8 years ago
|
||
The mysterious category 1 could also be bit flips. This is the machine code for the troublesome fragment: 0: 83 f9 88 cmp $0xffffff88,%ecx 3: 75 06 jne b <L> 5: 39 57 10 cmp %edx,0x10(%edi) 8: 75 01 jne b <L> a: c3 ret <L> I generated all possible one-bit flips of the first instruction. One possibility stands out: if the first byte becomes A3, then the CPU sees 0: a3 f9 88 75 06 mov %eax,0x67588f9 5: 39 57 10 cmp %edx,0x10(%edi) 8: 75 01 jne b <L> a: c3 ret which performs a write to memory at an address that's almost certainly inaccessible. It's not a _low_ address, though. It takes three bit flips to hit an instruction that could plausibly write to a low address: 0: 89 79 88 mov %edi,-0x78(%ecx) 3: 75 06 jne b <L> 5: 39 57 10 cmp %edx,0x10(%edi) 8: 75 01 jne b <L> a: c3 ret Still, what with all the other cases seeming to be memory corruption, I would suggest that this is more probable than a CPU bug.
Comment 54•8 years ago
|
||
Another one-bit flip possibility that I missed earlier: 0: 83 b9 88 75 06 39 57 cmpl $0x57,0x39067588(%ecx) 7: 10 75 01 adc %dh,0x1(%ebp) a: c3 ret That could hit a low address depending on what's in %ecx. (What _is_ in %ecx?)
Comment 55•8 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #51) > Thank you for the detailed analysis, Jan. > > The bitflips are scary. Jan, are you assuming that it's faulty hardware > that's the cause? It sounds like others are assuming that but I can't tell > if that's what you think. > > I think running a memtest on certain circumstances is a great idea. How hard > is it write a memtest? What circumstances would you run it under? Do we have > a bug open for this idea? Bug 995652 is our memtest on crash bug. I've filed bug 1270554 to work on memtest in the running firefox process.
Comment 56•8 years ago
|
||
Thanks for all comments. Replies below.. (In reply to Nicolas B. Pierron [:nbp] from comment #49) > Can we detect this cpu familly, and use the segfault handler to resume the > execution? In a similar way as operating system are emulating old > instructions on newer generations of cpu. Interesting idea but it seems complicated, also because we don't really know the state of the CPU when it misbehaves. (In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) from comment #50) > > Not sure what we should do here - we could try to emit some NOPS between the > > jumps and see if that helps... > > That could actually be very interesting, especially if you understand what > the alignment conditions described in the erratum are referring to. > (alignment of what?) It seems like it might be entirely possible to make > these crashes go away for JIT-generated code by changing the alignment of > jumps. Yeah I think as a first step we could try to emit NOPS as part of this particular IC stub, and see if it makes these crashes go away. > Though I'm curious if it's consistent with random > memory being bad for such a high proportion of such crashes to end up at > this particular spot. Are there a large number of these pointers? Yes, basically one for each interesting JS bytecode op. Also, we first emit the code and then at the end we write these pointers in it (once we know the values), so it's possible that write pattern happens to hit memory or cache lines in a way that makes it more error prone. (In reply to Nicholas Nethercote [:njn] from comment #51) > Jan, are you assuming that it's faulty hardware > that's the cause? It sounds like others are assuming that but I can't tell > if that's what you think. I think so, yeah. In theory it could be another thread doing something like *bytePtr ^= 0x1, but that also seems unlikely. Our JIT code is usually non-writable so the window for this is pretty small. Also, on Twitter people from the Chrome/V8 teams said they've seen similar bitflips. (In reply to Zack Weinberg (:zwol) from comment #53) > The mysterious category 1 could also be bit flips. That's not what I'm seeing in the memory dumps. Or do you mean a different kind of bitflip, somewhere in the CPU? > Still, what with all the other cases seeming to be memory corruption, I > would suggest that this is more probable than a CPU bug. Also if it's (a) *only* this exact CPU, and (b) *always* this particular piece of JIT code and (c) this CPU is *known* to be buggy when it comes to (indirect) branches?
Comment 57•8 years ago
|
||
It might be easy and interesting to compute a checksum of each block of machine code, and then check it before entry. Checksums can be pretty fast.
Comment 58•8 years ago
|
||
... *especially* checksums that need to detect only a single bit changing, without correction. We could do giant SSE xors, 128 bits at a time, over the code. It'd be on the scale of a memset or memcpy operation.
Comment 59•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #44) > (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD > family 20 model 2 stepping 0" CPU. These crashes are all similar: we're > executing the following Baseline type monitor stub: > > cmp $0xffffff88,%ecx > jne L > cmp %edx,0x10(%edi) > jne L > ret > L: > mov 0x4(%edi),%edi > jmp *(%edi) > > The first instruction is the one where we crash > (EXCEPTION_ACCESS_VIOLATION_READ or EXCEPTION_ACCESS_VIOLATION_WRITE with a > low address like 0x168). Yes, that makes no sense: this compare instruction > does not access any memory. FWIW, this hasn't been a characteristic of the other crashes we've seen with bug 772330. I believe for those, it's made sense how we would have crashed at the given instruction given that we ended up there in the state we were in. > I don't see crashes in this code with any other CPU. It's not the first time > this processor is causing trouble, see bug 772330 and also bug 1264188 > (although the latter is mostly model 1 and this is model 2). I wonder if > this could be erratum 688 or a similar bug - Baseline stubs definitely use a > lot of indirect jumps and calls. > > Example crash: bp-c70d9601-a96e-442b-ac05-d0ab52160501 I looked at this one a little bit, and I really don't see how we crashed. The JIT code was: 06EF5C50 83 F9 88 cmp ecx,0FFFFFF88h 06EF5C53 0F 85 0A 00 00 00 jne 06EF5C63 06EF5C59 39 57 10 cmp dword ptr [edi+10h],edx 06EF5C5C 0F 85 01 00 00 00 jne 06EF5C63 06EF5C62 C3 ret I wonder if there's a way to transform that into something that reads from EBX with a bit flip. (I mention EBX because the crash address is 0x80, which is the value of EBX.) (The closest I see is 8B 3B, which is four bit flips!)
Comment 60•8 years ago
|
||
(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) (busy May 9-13) from comment #59) > I wonder if there's a way to transform that into something that reads from > EBX with a bit flip. (I mention EBX because the crash address is 0x80, > which is the value of EBX.) I looked at some other reports and the crash address is often the value in either EAX or EBX. Another erratum that might be relevant here: > 578. Branch Prediction May Cause Incorrect Processor Behavior > > Under a highly specific and detailed set of internal timing conditions involving > multiple events occurring within a small window of time, the processor branch > prediction logic may cause the processor core to decode incorrect instruction > bytes. > > Potential Effect on System > > Unpredictable program behavior, generally leading to a program exception. I think "decoding incorrect instruction" bytes fits these crashes really well. This issue has been fixed, it's possible to check CPUID bits to see if the processor has the fix. Unfortunately there's not much information to go on so it's just guessing at this point.
For [@ js::jit::EnterBaselineMethod ], it's the #10 topcrash on release for 46.0.1. It looks fairly high volume on 47 beta 4 and beta as well. It doesn't show up much in 49 or 48. Is there something new going on? Does this seem actionable at all? Till we know, I'll track this for 46 and 47.
status-firefox46:
--- → affected
status-firefox47:
--- → affected
status-firefox48:
--- → affected
tracking-firefox46:
--- → +
tracking-firefox47:
--- → +
Flags: needinfo?(jdemooij)
Comment 62•8 years ago
|
||
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #61) > For [@ js::jit::EnterBaselineMethod ], it's the #10 topcrash on release for > 46.0.1. It looks fairly high volume on 47 beta 4 and beta as well. It > doesn't show up much in 49 or 48. Is there something new going on? Does this > seem actionable at all? > > Till we know, I'll track this for 46 and 47. I looked at this a bit and I don't think the beta crashes are very different from release. There's at least 1 user (on XP SP2) who submitted a pretty large number of beta crash reports.. They don't look very actionable though, maybe malware or bad hardware.
Flags: needinfo?(jdemooij)
![]() |
||
Comment 63•8 years ago
|
||
> (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD > family 20 model 2 stepping 0" CPU. > [...] > I don't see crashes in this code with any other CPU. It's not the first time > this processor is causing trouble, see bug 772330 and also bug 1264188 > (although the latter is mostly model 1 and this is model 2). I wonder if > this could be erratum 688 or a similar bug - Baseline stubs definitely use a > lot of indirect jumps and calls. I did some follow-up analysis on this. TL;DR: The following CPU families have suspiciously high EnterBaselineMethod crash rates. They are ranked from the most crashes to the least. > Cpu Info Count > AuthenticAMD family 16 model 6 stepping 3 | 2 18341 > AuthenticAMD family 20 model 2 stepping 0 | 2 13663 > AuthenticAMD family 22 model 0 stepping 1 | 2 6471 > AuthenticAMD family 21 model 19 stepping 1 | 2 5894 > AuthenticAMD family 16 model 6 stepping 3 | 1 233 > AuthenticAMD family 20 model 1 stepping 0 | 2 143 > AuthenticAMD family 6 model 8 stepping 1 | 1 102 > AuthenticAMD family 22 model 0 stepping 1 | 4 78 Note especially the many crashes outside of "family 20"! Perhaps AMD bugs are more widespread than we thought? Jan, it might be worth looking at JIT crashes in these other families. ---- Details: I did a super search for all Firefox crashes in the past 7 days, faceted on the "cpu info" field: https://crash-stats.mozilla.com/search/?product=Firefox&_facets=signature&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info I clicked on all 50 entries, opening a new tab for each one. Each tab thus held all the Firefox crashes in the past 7 days for a single CPU family. I then went through them all and manually extracted the rank and percentage for EnterBaselineMethod crashes, resulting in the following table. > Rank Cpu info Count % EnterBaselineMethod rank/% > - ALL FAMILIES COMBINED #14 0.73 % > 1 GenuineIntel family 6 model 23 stepping 10 | 2 137092 9.91 % #10 0.92 % > 2 GenuineIntel family 6 model 42 stepping 7 | 4 101213 7.32 % #32 0.40 % > 3 GenuineIntel family 6 model 58 stepping 9 | 4 98764 7.14 % #42 0.30 % > 4 GenuineIntel family 6 model 60 stepping 3 | 4 62630 4.53 % (not in top 50) > 5 * GenuineIntel family 6 model 15 stepping 13 | 2 59451 4.30 % #10 1.24 % > 6 GenuineIntel family 6 model 42 stepping 7 | 2 47232 3.41 % #24 0.47 % > 7 GenuineIntel family 6 model 69 stepping 1 | 4 44410 3.21 % (not in top 50) > 8 GenuineIntel family 6 model 37 stepping 5 | 4 39198 2.83 % #28 0.41 % > 9 GenuineIntel family 6 model 58 stepping 9 | 2 34429 2.49 % #24 0.46 % > 10 ??? 27544 1.99 % ??? > 11 GenuineIntel family 6 model 61 stepping 4 | 4 25239 1.82 % (not in top 50) > 12 family 6 model 69 stepping 1 | 4 24855 1.80 % #27 0.15 % > 13 GenuineIntel family 6 model 60 stepping 3 | 8 23132 1.67 % (not in top 50) > 14 GenuineIntel family 6 model 23 stepping 6 | 2 20895 1.51 % #15 0.75 % > 15 *** AuthenticAMD family 16 model 6 stepping 3 | 2 18341 1.33 % #3 3.79 % > 16 family 6 model 58 stepping 9 | 4 17604 1.27 % #16 0.66 % > 17 family 6 model 42 stepping 7 | 4 17084 1.23 % #38 0.36 % > 18 GenuineIntel family 6 model 37 stepping 2 | 4 16006 1.16 % #28 0.46 % > 19 GenuineIntel family 6 model 58 stepping 9 | 8 15807 1.14 % #41 0.33 % > 20 GenuineIntel family 6 model 60 stepping 3 | 2 14002 1.01 % #30 0.36 % > 21 *** AuthenticAMD family 20 model 2 stepping 0 | 2 13663 0.99 % #1 4.50 % > 22 * GenuineIntel family 6 model 15 stepping 11 | 2 13360 0.97 % #9 1.16 % > 23 family 6 model 23 stepping 10 | 2 13193 0.95 % #17 0.75 % > 24 * AuthenticAMD family 16 model 6 stepping 2 | 2 12956 0.94 % #4 1.52 % > 25 GenuineIntel family 6 model 42 stepping 7 | 8 12083 0.87 % #21 0.53 % > 26 GenuineIntel family 6 model 15 stepping 2 | 2 10930 0.79 % #13 0.91 % > 27 * GenuineIntel family 15 model 6 stepping 5 | 2 9988 0.72 % #7 1.36 % > 28 GenuineIntel family 6 model 15 stepping 6 | 2 9380 0.68 % #13 0.81 % > 29 GenuineIntel family 6 model 55 stepping 8 | 2 9241 0.67 % (not in top 50) > 30 GenuineIntel family 6 model 55 stepping 8 | 4 8767 0.63 % (not in top 50) > 31 family 6 model 58 stepping 9 | 8 8690 0.63 % #28 0.41 % > 32 * GenuineIntel family 15 model 4 stepping 3 | 2 8179 0.59 % #11 1.19 % > 33 * AuthenticAMD family 15 model 107 stepping 2 | 2 8130 0.59 % #11 1.02 % > 34 * GenuineIntel family 6 model 22 stepping 1 | 1 7811 0.56 % #9 1.13 % > 35 GenuineIntel family 6 model 37 stepping 5 | 2 7714 0.56 % (not in top 50) > 36 GenuineIntel family 6 model 23 stepping 10 | 4 7568 0.55 % #12 0.78 % > 37 * GenuineIntel family 15 model 4 stepping 1 | 2 6837 0.49 % #8 1.23 % > 38 * GenuineIntel family 15 model 2 stepping 9 | 1 6692 0.48 % #8 1.39 % > 39 *** AuthenticAMD family 22 model 0 stepping 1 | 2 6471 0.47 % #5 4.42 % > 40 family 6 model 70 stepping 1 | 8 6385 0.46 % #39 0.33 % > 41 * GenuineIntel family 15 model 4 stepping 9 | 2 6176 0.45 % #86 1.39 % > 42 family 6 model 37 stepping 5 | 4 6064 0.44 % #27 0.46 % > 43 * GenuineIntel family 15 model 4 stepping 1 | 1 5970 0.43 % #7 1.57 % > 44 *** AuthenticAMD family 21 model 19 stepping 1 | 2 5894 0.43 % #3 3.89 % > 45 GenuineIntel family 6 model 28 stepping 10 | 2 5522 0.40 % #23 0.42 % > 46 * AuthenticAMD family 18 model 1 stepping 0 | 2 5491 0.40 % #5 1.60 % > 47 GenuineIntel family 6 model 78 stepping 3 | 4 5489 0.40 % (not in top 50) > 48 family 6 model 42 stepping 7 | 8 5387 0.39 % #27 0.48 % > 49 GenuineIntel family 6 model 45 stepping 7 | 4 5368 0.39 % (not in top 50) > 50 * AuthenticAMD family 21 model 16 stepping 1 | 2 5353 0.39 % #9 1.20 % Over all CPU families, EnterBaselineMethod crashes were 0.73% of all crashes. Looking at individual CPU families, four of them stood out as having EnterBaselineMethod crash rates in the range 3.79--4.50%. These are marked with '***'. I also marked ones with an EnterBaselineMethod crash rate greater than 1% with '*', but those could just be natural variation. I then searched for all EnterBaselineMethod crashes in Firefox in the past 7 days, faceted by Cpu Info: https://crash-stats.mozilla.com/search/?product=Firefox&signature=%3Djs%3A%3Ajit%3A%3AEnterBaselineMethod&_facets=signature&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info I then cross-correlated these ranks with the ranks in the table above, giving this table: > rank in table 1 > 1 GenuineIntel family 6 model 23 stepping 10 | 2 1262 12.56 % #1 > 2 GenuineIntel family 6 model 15 stepping 13 | 2 738 7.34 % #5 > 3 AuthenticAMD family 16 model 6 stepping 3 | 2 695 6.92 % #15 *** > 4 AuthenticAMD family 20 model 2 stepping 0 | 2 619 6.16 % #21 *** > 5 GenuineIntel family 6 model 42 stepping 7 | 4 404 4.02 % #2 > 6 GenuineIntel family 6 model 58 stepping 9 | 4 292 2.91 % #3 > 7 AuthenticAMD family 22 model 0 stepping 1 | 2 285 2.84 % #39 *** > 8 AuthenticAMD family 16 model 6 stepping 3 | 1 233 2.32 % N/A *** > 9 AuthenticAMD family 21 model 19 stepping 1 | 2 228 2.27 % #44 *** > 10 GenuineIntel family 6 model 42 stepping 7 | 2 220 2.19 % #6 > 11 AuthenticAMD family 16 model 6 stepping 2 | 2 196 1.95 % #24 * > 12 GenuineIntel family 6 model 37 stepping 5 | 4 162 1.61 % #8 > 13 GenuineIntel family 6 model 58 stepping 9 | 2 159 1.58 % #9 > 14 GenuineIntel family 6 model 23 stepping 6 | 2 157 1.56 % #14 > 15 GenuineIntel family 6 model 15 stepping 11 | 2 155 1.54 % #22 > 16 AuthenticAMD family 20 model 1 stepping 0 | 2 143 1.42 % N/A *** > 17 GenuineIntel family 15 model 6 stepping 5 | 2 135 1.34 % #27 > 18 GenuineIntel family 6 model 60 stepping 3 | 4 118 1.17 % #4 > 19 AuthenticAMD family 6 model 8 stepping 1 | 1 103 1.02 % N/A *** > 20 GenuineIntel family 6 model 15 stepping 2 | 2 99 0.99 % #26 > 21 GenuineIntel family 15 model 4 stepping 3 | 2 95 0.95 % #32 * > 22 GenuineIntel family 15 model 4 stepping 1 | 1 94 0.94 % #43 * > 23 GenuineIntel family 15 model 2 stepping 9 | 1 93 0.93 % #38 * > 24 AuthenticAMD family 18 model 1 stepping 0 | 2 90 0.90 % #46 * > 25 GenuineIntel family 6 model 22 stepping 1 | 1 89 0.89 % #34 * > 26 GenuineIntel family 15 model 4 stepping 9 | 2 85 0.85 % #41 * > 27 GenuineIntel family 15 model 4 stepping 1 | 2 84 0.84 % #37 * > 28 AuthenticAMD family 15 model 107 stepping 2 | 2 83 0.83 % #33 > 29 AuthenticAMD family 22 model 0 stepping 1 | 4 78 0.78 % N/A ** > 30 GenuineIntel family 6 model 15 stepping 6 | 2 76 0.76 % #28 > 31 AuthenticAMD family 6 model 10 stepping 0 | 1 75 0.75 % N/A ** > 32 GenuineIntel family 6 model 37 stepping 2 | 4 72 0.72 % #18 > 33 GenuineIntel family 6 model 69 stepping 1 | 4 71 0.71 % #7 > 34 GenuineIntel family 15 model 6 stepping 5 | 1 67 0.67 % N/A * > 35 AuthenticAMD family 21 model 16 stepping 1 | 2 64 0.64 % #50 * > 36 GenuineIntel family 6 model 42 stepping 7 | 8 64 0.64 % #25 > 37 AuthenticAMD family 16 model 5 stepping 3 | 3 58 0.58 % N/A * > 38 AuthenticAMD family 21 model 16 stepping 1 | 4 58 0.58 % N/A * > 39 GenuineIntel family 6 model 23 stepping 10 | 4 57 0.57 % #36 > 40 AuthenticAMD family 16 model 6 stepping 2 | 1 55 0.55 % N/A * > 41 AuthenticAMD family 21 model 48 stepping 1 | 4 55 0.55 % N/A * > 42 GenuineIntel family 15 model 4 stepping 9 | 1 55 0.55 % N/A > 43 GenuineIntel family 6 model 58 stepping 9 | 8 53 0.53 % #19 > 44 GenuineIntel family 6 model 60 stepping 3 | 2 50 0.50 % #20 > 45 AuthenticAMD family 18 model 1 stepping 0 | 4 49 0.49 % N/A > 46 GenuineIntel family 6 model 60 stepping 3 | 8 47 0.47 % #13 > 47 AuthenticAMD family 16 model 4 stepping 3 | 4 46 0.46 % N/A > 48 AuthenticAMD family 15 model 75 stepping 2 | 2 43 0.43 % N/A > 49 AuthenticAMD family 16 model 5 stepping 3 | 4 43 0.43 % N/A > 50 GenuineIntel family 15 model 2 stepping 7 | 1 41 0.41 % N/A The relative position of a CPU family in the two tables indicates its crash rate. For example, the first entry is unsurprising -- "GenuineIntel family 6 model 23 stepping 10" is the #1 family with an EnterBaselineMethod, but it's also the #1 CPU overall. But entries #3 and #4 in this table had much lower rankings in the first table, which suggests they have unusually high EnterBaselineMethod crash rates, and indeed they were two of the previously-identified suspicious ones. There are also some entries that show up reasonably high in this table, but didn't show up at all in the previous table. So I looked them up and this gave us a few more entries that could be added to the first table: > ?? AuthenticAMD family 16 model 6 stepping 3 | 1 233 ?.?? % #? 5.34 % > ?? AuthenticAMD family 20 model 1 stepping 0 | 2 143 ?.?? % #? 3.63 % > ?? AuthenticAMD family 6 model 8 stepping 1 | 1 102 ?.?? % #? 2.66 % > ?? AuthenticAMD family 22 model 0 stepping 1 | 4 78 ?.?? % #? 1.68 % The first entry here, despite having a low number of crashes -- it must just be an uncommon CPU family -- had an even higher EnterBaselineMethod crash rate of 5.34%. This analysis isn't perfect because other crash signatures may also have correlations against CPU family. Ideally we'd match the EnterBaselineMethod crash rates for each CPU family against the CPU family usage among our user population, perhaps from telemetry data.
Updated•8 years ago
|
Comment 64•8 years ago
|
||
Crash volume for signature 'js::jit::EnterBaselineMethod': - nightly (version 50): 3 crashes from 2016-06-06. - aurora (version 49): 6 crashes from 2016-06-07. - esr (version 45): 1324 crashes from 2016-04-07. Crash volume on the last weeks: Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7 - nightly 0 2 0 0 0 0 1 - aurora 3 1 0 0 1 1 0 - esr 197 161 157 142 176 155 89 Affected platforms: Windows, Mac OS X, Linux
status-firefox49:
--- → affected
status-firefox50:
--- → affected
status-firefox-esr45:
--- → affected
Comment 65•8 years ago
|
||
hmm, if this is the amd bug, that means we had a esr version being impacted...
Comment 66•8 years ago
|
||
Based on comment 44, I would expect us to still have a non-zero baseline of crashes, especially on old hardware. I guess the likelyhood of using release / esr version might be higher on old hardware. Bug 1281759 only landed in Gecko 50, so this should not have changed aurora. Could this be a problem with the crash reporter when we have no stack frame at the top? Or maybe we discard these reports? Or they are classified with a bunch of different signature?
Comment 67•7 years ago
|
||
Crash volume for signature 'js::jit::EnterBaselineMethod': - nightly (version 51): 1 crash from 2016-08-01. - aurora (version 50): 1 crash from 2016-08-01. - beta (version 49): 77 crashes from 2016-08-02. - release (version 48): 6134 crashes from 2016-07-25. - esr (version 45): 1674 crashes from 2016-05-02. Crash volume on the last weeks (Week N is from 08-22 to 08-28): W. N-1 W. N-2 W. N-3 - nightly 0 0 0 - aurora 1 0 0 - beta 28 22 9 - release 1970 1820 1015 - esr 68 47 121 Affected platforms: Windows, Mac OS X, Linux Crash rank on the last 7 days: Browser Content Plugin - nightly #730 - aurora - beta #589 #434 - release #9 #4 - esr #156
status-firefox51:
--- → affected
Comment 68•7 years ago
|
||
Crash volume for signature 'js::jit::EnterBaselineMethod': - nightly (version 52): 2 crashes from 2016-09-19. - aurora (version 51): 0 crashes from 2016-09-19. - beta (version 50): 41 crashes from 2016-09-20. - release (version 49): 94 crashes from 2016-09-05. - esr (version 45): 1728 crashes from 2016-06-01. Crash volume on the last weeks (Week N is from 10-03 to 10-09): W. N-1 W. N-2 - nightly 0 2 - aurora 0 0 - beta 34 7 - release 67 27 - esr 200 186 Affected platforms: Windows, Linux Crash rank on the last 7 days: Browser Content Plugin - nightly - aurora - beta #440 #631 - release #975 #474 - esr #57
status-firefox52:
--- → affected
Updated•7 years ago
|
Priority: -- → P3
Comment 69•7 years ago
|
||
Mass wontfix for bugs affecting firefox 52.
Updated•6 years ago
|
Crash Signature: [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)]
[@ js::jit::EnterBaselineMethod] → [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)]
[@ js::jit::EnterBaselineMethod]
[@ EnterJit]
Comment 71•6 years ago
|
||
Adding this to our crash triage list.
Assignee: jdemooij → nobody
Whiteboard: [native-crash] → [native-crash][#jsapi:crashes-retriage]
Comment 72•6 years ago
|
||
Closing in favor of meta-bug Bug 858032. Current investigations branch off there.
Blocks: SadJit
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Whiteboard: [native-crash][#jsapi:crashes-retriage] → [native-crash]
Comment 73•6 years ago
|
||
If this was due (or at least partially) to bug 1281759 (I don't know, might or might not be, updated stats wouldn't hurt I guess), then I'm not sure how smart it could be to refer to a meta issue.
Comment 74•6 years ago
|
||
The signature encompasses a number of reasons. There have also been numerous renames of JIT signatures which has added to confusion. The meta-bug should be referring to Bug 1281759 as one of the source of crashes.
Updated•2 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•