1034706 - crash in js::jit::EnterBaselineMethod(JSContext*, js::RunState&)

Reporter

Description

•

10 years ago

This bug was filed from the Socorro interface and is 
report bp-23cc14e5-2e4f-4f96-ab95-2cf572140627.
=============================================================
0 	mozjs.dll 	js::jit::EnterBaselineMethod(JSContext *,js::RunState &) 	js/src/jit/BaselineJIT.cpp
1 	mozjs.dll 	Interpret 	js/src/vm/Interpreter.cpp
2 	mozjs.dll 	js::RunScript(JSContext *,js::RunState &) 	js/src/vm/Interpreter.cpp
3 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
4 	mozjs.dll 	js_fun_apply(JSContext *,unsigned int,JS::Value *) 	js/src/jsfun.cpp
5 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
6 	mozjs.dll 	Interpret 	js/src/vm/Interpreter.cpp
7 	mozjs.dll 	js::RunScript(JSContext *,js::RunState &) 	js/src/vm/Interpreter.cpp
8 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
9 	mozjs.dll 	js_fun_apply(JSContext *,unsigned int,JS::Value *) 	js/src/jsfun.cpp
10 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
11 	mozjs.dll 	Interpret 	js/src/vm/Interpreter.cpp
12 	mozjs.dll 	js::RunScript(JSContext *,js::RunState &) 	js/src/vm/Interpreter.cpp
13 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
14 	mozjs.dll 	js::Invoke(JSContext *,JS::Value const &,JS::Value const &,unsigned int,JS::Value const *,JS::MutableHandle<JS::Value>) 	js/src/vm/Interpreter.cpp
15 	mozjs.dll 	JS::Call(JSContext *,JS::Handle<JS::Value>,JS::Handle<JS::Value>,JS::HandleValueArray const &,JS::MutableHandle<JS::Value>) 	js/src/jsapi.cpp
16 	xul.dll 	mozilla::dom::EventListener::HandleEvent(JSContext *,JS::Handle<JS::Value>,mozilla::dom::Event &,mozilla::ErrorResult &) 	obj-firefox/dom/bindings/EventListenerBinding.cpp
17 	xul.dll 	mozilla::dom::EventListener::HandleEvent<mozilla::dom::EventTarget *>(mozilla::dom::EventTarget * const &,mozilla::dom::Event &,mozilla::ErrorResult &,mozilla::dom::CallbackObject::ExceptionHandling) 	obj-firefox/dist/include/mozilla/dom/EventListenerBinding.h
18 	xul.dll 	mozilla::EventListenerManager::HandleEventSubType(mozilla::EventListenerManager::Listener *,nsIDOMEvent *,mozilla::dom::EventTarget *) 	dom/events/EventListenerManager.cpp
19 	xul.dll 	mozilla::EventTargetChainItem::HandleEventTargetChain(nsTArray<mozilla::EventTargetChainItem> &,mozilla::EventChainPostVisitor &,mozilla::EventDispatchingCallback *,mozilla::ELMCreationDetector &) 	dom/events/EventDispatcher.cpp
20 	xul.dll 	mozilla::EventDispatcher::Dispatch(nsISupports *,nsPresContext *,mozilla::WidgetEvent *,nsIDOMEvent *,nsEventStatus *,mozilla::EventDispatchingCallback *,nsCOMArray<mozilla::dom::EventTarget> *) 	dom/events/EventDispatcher.cpp
=============================================================
More reports: https://crash-stats.mozilla.com/report/list?product=Firefox&signature=js%3A%3Ajit%3A%3AEnterBaselineMethod%28JSContext%2A%2C+js%3A%3ARunState%26%29

This is the same signature as a recent B2G topcrasher (bug 978450) but affects Desktop Firefox. This signature has been around on Desktop for a long time but has recently exploded on Beta by an extreme margin starting on 2014-07-02.
https://crash-analysis.mozilla.com/rkaiser/2014-07-03/2014-07-03.firefox.31.explosiveness.html

This is currently #37 across 7-days and #22 across 3-days. While not strictly a "topcrash" yet I'm marking it as such based on explosiveness.

Looking at the product correlation the volume is really high on the latest Beta compared to the previous Beta, and is really high on the latest Nightly compared to the latest Aurora.

> Firefox 31.0b6: 55.56%
> Firefox 33.0a1: 32.98%
> Firefox 32.0a2: 5.38%
> Firefox 31.0b5: 2.05%

Crashes per Install seems to indicate people are crashing here more than once:
> Firefox 31.0b6: 785 crashes per 622 installs
> Firefox 33.0a1: 466 crashes per 214 installs 

Facebook seems to be the top URL in the correlations by far.

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

10 years ago

status-firefox30: --- → unaffected

tracking-firefox31: ? → +

tracking-firefox32: ? → +

tracking-firefox33: ? → +

Nicolas B. Pierron [:nbp]

Comment 1

•

10 years ago

(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #0)
> This bug was filed from the Socorro interface and is 
> report bp-23cc14e5-2e4f-4f96-ab95-2cf572140627.
> =============================================================
> 0 	mozjs.dll 	js::jit::EnterBaselineMethod(JSContext *,js::RunState &) 
> js/src/jit/BaselineJIT.cpp
> 1 	mozjs.dll 	Interpret 	js/src/vm/Interpreter.cpp
> 2 	mozjs.dll 	js::RunScript(JSContext *,js::RunState &) 
> =============================================================

In general such stack (EnterBaselineMethod) is useless as we enter some generated code, we we do not know what code is being executed when these crashes are happening.

> Looking at the product correlation the volume is really high on the latest
> Beta compared to the previous Beta, and is really high on the latest Nightly
> compared to the latest Aurora.
> 
> > Firefox 31.0b6: 55.56%
> > Firefox 33.0a1: 32.98%
> > Firefox 32.0a2: 5.38%
> > Firefox 31.0b5: 2.05%
> 

Changelog from Firefox 31.0b5 to Firefox 31.0b6:
http://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=a04918ac3197&tochange=9f7d43269809

Terrence, could that be related to Bug 1028358?
Anthony, can somebody from QA find a way to reproduce this issue?

Flags: needinfo?(terrence)

Flags: needinfo?(anthony.s.hughes)

u279076

Reporter

Comment 2

•

10 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #1)
> Anthony, can somebody from QA find a way to reproduce this issue?

There's really nothing useful in any of the reports to help guide testing. Is there anything in the pushlog which stands out that we could test around?

Flags: needinfo?(anthony.s.hughes)

Terrence Cole [:terrence]

Comment 3

•

10 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #1)
> 
> Changelog from Firefox 31.0b5 to Firefox 31.0b6:
> http://hg.mozilla.org/releases/mozilla-beta/
> pushloghtml?fromchange=a04918ac3197&tochange=9f7d43269809
> 
> Terrence, could that be related to Bug 1028358?

I don't think so. That barrier code is not used by the jits, it would only increase the live set anyway, and the crash is a null deref, not a UAF. I don't think GC is likely to be implicated here.

> Anthony, can somebody from QA find a way to reproduce this issue?

Flags: needinfo?(terrence)

Nicolas B. Pierron [:nbp]

Comment 4

•

10 years ago

(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #2)
> (In reply to Nicolas B. Pierron [:nbp] from comment #1)
> > Anthony, can somebody from QA find a way to reproduce this issue?
> 
> There's really nothing useful in any of the reports to help guide testing.

No, reports with EnterBaseline are just saying “Hey we are executing some JavaScript that we have executed more than 10 times before”.

Which does not help to find what is the context of the failure.

> Is there anything in the pushlog which stands out that we could test around?

I look at it, and the only commit which stand out is Bug 1028358, but Terrence replied to this hypothesis in comment 3.

The other option would be that this is something new in facebook pages (comment 0), which is causing more failures by highlighting one existing bug which might be in the tree since a moment.

Sylvestre Ledru [:Sylvestre]

Comment 5

•

10 years ago

Untracking. No activity and too late for 31

status-firefox31: affected → wontfix

tracking-firefox31: + → -

Lawrence Mandel [:lmandel] (use needinfo)

Comment 6

•

10 years ago

I'm marking as won't fix for 32 as there has been no activity. ni Naveed to help get this top crash prioritized.

status-firefox32: affected → wontfix

status-firefox34: --- → affected

tracking-firefox34: --- → +

Flags: needinfo?(nihsanullah)

Jan de Mooij [:jandem]

Updated

•

10 years ago

Flags: needinfo?(nihsanullah)

Sylvestre Ledru [:Sylvestre]

Comment 8

•

10 years ago

Jan, so, how do the stats look like? Thanks

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Comment 9

•

10 years ago

(In reply to Sylvestre Ledru [:sylvestre] from comment #8)
> Jan, so, how do the stats look like? Thanks

The fix mentioned in comment 7 helped a bit (and is in 32). But EnterBaselineMethod is still at #4 for 32, #10 for 33.

Unfortunately (top-)crashes in JIT code are not a new thing; we've had them since the first Firefox releases with a JIT. I looked at some of the crash reports recently and most of those were caused by memory corruption that's impossible to track down... It could even be code outside the JS engine that's misbehaving and corrupting our code.

I'll keep an eye on crash-stats though.

Flags: needinfo?(jdemooij)

Sylvestre Ledru [:Sylvestre]

Comment 10

•

10 years ago

OK. Thanks for the feedback.
I guess this is going to be a wontfix for 33.

Sylvestre Ledru [:Sylvestre]

Comment 11

•

10 years ago

Wontfix for 33 then.

status-firefox33: affected → wontfix

status-firefox35: --- → affected

tracking-firefox35: --- → +

Lawrence Mandel [:lmandel] (use needinfo)

Comment 12

•

10 years ago

Given comment 9, is there anything else that we can do in this bug?

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Comment 13

•

10 years ago

(In reply to Lawrence Mandel [:lmandel] from comment #12)
> Given comment 9, is there anything else that we can do in this bug?

If there's a new spike or a website that crashes reliably we'd be happy to investigate and fix it, but the current crashes look like random memory corruption and there's not much we can do.

This bug is not really actionable, so I don't know if we should track it.

Flags: needinfo?(jdemooij)

Lawrence Mandel [:lmandel] (use needinfo)

Comment 14

•

10 years ago

Kairo/Anthony - Is this still a topcrash in 33/34/35? If so, is there any more information that you can provide to assist with debugging? If not, this looks like a resolved/incomplete to me.

Flags: needinfo?(kairo)

Flags: needinfo?(anthony.s.hughes)

u279076

Reporter

Comment 15

•

10 years ago

I looked over the stats for this signature and this does not seem to qualify as a topcrash anymore, though it is still affecting some users.

> 33.0*: 90 reports
> 34.0*: 28 reports
> 35.0*: 1 report
> 36.0*: 25 reports
https://crash-stats.mozilla.com/report/list?product=Firefox&range_value=7&range_unit=days&date=2014-10-22&signature=js%3A%3Ajit%3A%3AEnterBaselineMethod%28JSContext*%2C+js%3A%3ARunState%26%29

Flags: needinfo?(kairo)

Flags: needinfo?(anthony.s.hughes)

Keywords: topcrash-win

Lawrence Mandel [:lmandel] (use needinfo)

Comment 16

•

10 years ago

Given the data in comment 15 and the lack of additional information for debugging, I think this can likely be resolved. I want to wait until at least tomorrow to give Kairo a chance to comment.

Robert Kaiser

Comment 17

•

10 years ago

Well, if we resolve it, we might need another bug for tracking the ongoing (but unactionable probably) amount of crashes we have all the time with this signature, which probably in reality is all kinds of different things crashing actually *inside* baseline-compiled code.

Lawrence Mandel [:lmandel] (use needinfo)

Comment 18

•

10 years ago

I'm going to leave this open so that we have somewhere to track (per Kairo in comment 17) but am dropping tracking as this is currently inactionable.

tracking-firefox34: + → ---

tracking-firefox35: + → ---

Petruta Horea [:phorea], Desktop QA

Comment 19

•

9 years ago

This signature now affects Developer Edition 39.0a2 2015-03-30 win32 builds under Windows at start-up. The builds are unusable. 

Win64 builds are not affected under Windows. 
Linux and Mac builds can be started and used.

status-firefox39: --- → affected

Sylvestre Ledru [:Sylvestre]

Comment 20

•

9 years ago

Naveed, seems like we need your help! Could you help us with that? Thanks (this is critical as we cannot reenable 39 aurora updates).

tracking-firefox39: --- → +

Flags: needinfo?(nihsanullah)

Ryan VanderMeulen [:RyanVM]

Comment 21

•

9 years ago

Ugh, we have the same issue in automation at the moment in bug 1149377. I'm working on bisecting it now, but being pgo-only isn't helping.

Comment 22

•

9 years ago

FWIW, this bug clearly pre-dates whatever's going on with Aurora since yesterday's uplift. I think we should track the new problem over in bug 1149377 rather than this one.

Sylvestre Ledru [:Sylvestre]

Comment 23

•

9 years ago

Stop tracking this one and tracking bug 1149377 instead.

tracking-firefox39: + → -

Robert Kaiser

Comment 24

•

9 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #22)
> FWIW, this bug clearly pre-dates whatever's going on with Aurora since
> yesterday's uplift. I think we should track the new problem over in bug
> 1149377 rather than this one.

Yes, the signature in here is pretty much a catch-all for a class of crashes in the Baseline JIT.

Naveed Ihsanullah [:naveed]

Comment 25

•

9 years ago

nbp and jandem are working on the current issue in bug 1149377. 

For the next time we end up here: This stack by itself (and therefore this specific bug) is not really actionable. It may imply a code generation problem or an exception occurred while processing warm JS. Bisection or another hint will probably be needed to work the issue and a more specific bug should be opened.

Flags: needinfo?(nihsanullah)

alex_mayorga

Comment 26

•

9 years ago

(In reply to Naveed Ihsanullah [:naveed] from comment #25)
> nbp and jandem are working on the current issue in bug 1149377. 
> 
> For the next time we end up here: This stack by itself (and therefore this
> specific bug) is not really actionable. It may imply a code generation
> problem or an exception occurred while processing warm JS. Bisection or
> another hint will probably be needed to work the issue and a more specific
> bug should be opened.

¡Hola Naveed!

FWIW I've filed https://bugzilla.mozilla.org/show_bug.cgi?id=1200685

Hope it is useful else let me know and I'd close it =)

¡Gracias!

Flags: needinfo?(nihsanullah)

Naveed Ihsanullah [:naveed]

Comment 27

•

9 years ago

Ill pass the bug on to Jan. I don't see any additional actionable information in that bug but perhaps Jan can tell more. 

Jan can we instrument the code for these class of crashes so more information is available to us in the crash reports?

Flags: needinfo?(nihsanullah)

Jan de Mooij [:jandem]

Updated

•

9 years ago

Flags: needinfo?(jdemooij)

BMO Automation

Updated

•

9 years ago

Crash Signature: [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)] → [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)] [@ js::jit::EnterBaselineMethod]

Tobias B. Besemer [:BesTo] (QA)

Comment 29

•

9 years ago

"Assignee:" taken over from Bug 1200685.

Assignee: nobody → jdemooij

Blocks: shutdownkill

status-firefox41: --- → ?

status-firefox42: --- → ?

status-firefox43: --- → affected

status-firefox44: --- → ?

status-firefox45: --- → ?

Whiteboard: ShutDownKill

Tobias B. Besemer [:BesTo] (QA)

Comment 31

•

9 years ago

From Bug 956980 ...

Summary: crash in js::jit::EnterBaselineMethod(JSContext*, js::RunState&) mostly with cached documents

https://bugzilla.mozilla.org/show_bug.cgi?id=956980#c0
(In reply to Kevin Brosnan [:kbrosnan] from comment #0)
> This bug was filed from the Socorro interface and is 
> report bp-77126f40-348c-46eb-9f74-79c772140106.
> =============================================================
> 
> Nothing useful in comments. Almost all the URLs have wyciwyg which suggests
> the documents were retrieved from the cache. Wired URLs represent 10 out of
> the 13 submitted URLs.
> 
> wyciwyg://0/http://www.wired.com/opinion/2013/11/so-the-internets-about-to-
> lose-its-net-neutrality/
> 
> wyciwyg://0/http://www.wired.com/opinion/2012/11/cease-and-desist-manuals-
> planned-obsolescence/
> 
> There are two non-cache URLs and those are
> 
> http://www.photoprikol.net/photo/138-igrushki-sssr-72-foto.html
> 
> https://www.facebook.com/

Whiteboard: ShutDownKill → [native-crash], ShutDownKill

Tobias B. Besemer [:BesTo] (QA)

Comment 32

•

9 years ago

+ Emails from the dups ...

Jan de Mooij [:jandem]

Comment 33

•

9 years ago

(In reply to Naveed Ihsanullah [:naveed] from comment #27)
> Ill pass the bug on to Jan. I don't see any additional actionable
> information in that bug but perhaps Jan can tell more. 
> 
> Jan can we instrument the code for these class of crashes so more
> information is available to us in the crash reports?

Yeah these crashes aren't really actionable. JIT crashes are caused by different bugs and many of the reports are random memory corruption. We want to hear about spikes and reproducible cases though.

Making JIT code non-writable may help us catch memory corruption bugs sooner/elsewhere. That's bug 1215479 but it's pretty hard to do without regressing performance.

Flags: needinfo?(jdemooij)

Nicolas B. Pierron [:nbp]

Comment 34

•

9 years ago

(In reply to Jan de Mooij [:jandem] from comment #33)
> Making JIT code non-writable may help us catch memory corruption bugs
> sooner/elsewhere. That's bug 1215479 but it's pretty hard to do without
> regressing performance.

Could we only re-protect the code 1/10th of the time?  Thus, amortize the cost of protecting the pages, and potentially catch some of these other issues without huge performance regressions, while providing a better crash-stack.

Ciprian Muresan [:cmuresan], Ecosystem QA

Comment 35

•

8 years ago

From the crash signature [@ js::jit::EnterBaselineMethod ], the affected versions are:
- Nightly: 47
- Aurora: 46, 45
- Beta: 45.0b1, 45.0b2, 44.0b99, 44.0b1, 44.0b9, 44.0b8, 44.0b6, 44.0b2, 44.0b7

In the crash signature [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&) ] there are no reports in the last 28 days.

Biru [:poiru]

Updated

•

8 years ago

Blocks: e10s-crashes

Brad Lassey [:blassey] (use needinfo?)

Updated

•

8 years ago

tracking-e10s: --- → ?

Tracy Walker [:tracy]

Comment 36

•

8 years ago

Currently, for the past 7 days, there are 2800 crashes reported for beta and only 12 reported on nightly for [@ js::jit::EnterBaselineMethod]

Nicolas B. Pierron [:nbp]

Comment 37

•

8 years ago

important

(In reply to [:tracy] Tracy Walker from comment #36)
> Currently, for the past 7 days, there are 2800 crashes reported for beta and
> only 12 reported on nightly for [@ js::jit::EnterBaselineMethod]

As mentioned all along this bug, this signature is not actionable.
To investigate such issues, here are some of the fastest ways forward:
 - Reproduce the issue with one of the reported URL.
 - List all backported patches, since the last version. (comment 25)
 - Find an actionable existing bug which highlights the same crash characteristics (crash address, stack pointer, …).

With none of these information, I would not expect any investigation from the JS Team as we are likely to arm our users more with random urgent fixes.

Jim Mathies [:jimm]

Comment 38

•

8 years ago

This is a generic crash that doesn't appear to afflict e10s more or less than non-e10s. Untracking.

45.0b6 content process crashes - 228
45.0b6 crashes with e10s disabled - 2165
The percentage of beta users running e10s during our experiment was about 10%.

tracking-e10s: ? → -

Kan-Ru Chen [:kanru] (UTC+9)

Updated

•

8 years ago

No longer blocks: e10s-crashes

Jim Mathies [:jimm]

Comment 39

•

8 years ago

Looking at beta 46 (5, 6, 7) experiment crash data, this shows up twice as often under e10s. It is also the #8 top crasher.

Blocks: e10s-crashes

Jim Mathies [:jimm]

Comment 40

•

8 years ago

Jan, any suggestions here on how to proceed with this under e10s?

https://crash-stats.mozilla.com/search/?product=Firefox&version=46.0b7&version=46.0b6&version=46.0b5&dom_ipc_enabled=!__null__&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Comment 41

•

8 years ago

(In reply to Jim Mathies [:jimm] from comment #40)
> Jan, any suggestions here on how to proceed with this under e10s?

I looked at some of these beta crash dumps and it's the usual mix. Most common are:

* Valid JIT code but some invalid bytes in the middle. This is pretty weird and code corruption should be less likely now with W^X (also in 46). I'm currently tracking down a related crash in bug 1124397. That's probably some other thread misbehaving. I'll continue working on that one.

* Valid JIT code but reading/writing invalid memory. JIT code accesses a lot of things and this is probably similar to the GC topcrashes we have.

* Some crashes remind me of bug 1260721. I'll see what we can do there.

Unfortunately most of these look like random memory corruption. If these crashes are worse with e10s, maybe we have some heap corruption bugs there?

Flags: needinfo?(jdemooij)

jd2978@outlook.com

Comment 42

•

8 years ago

Still a current problem. Firefox crashes in less than 2 mins after opening. Open a new tab, open a browser. Many times, when the page attemps to render it crashes. About 5 different crash reasons. Says not a plugin crash

Liz Henry (:lizzard) (relman/hg->git project)

Comment 43

•

8 years ago

#2 topcrash on 46 release right now (pretty high volume, just under OOM crashes). e10s should be disabled on release. People are complaining that they are hitting the crash after updating. The crash spike may also be correlated with AV software (see bug 1268025)

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Comment 44

•

8 years ago

Today I looked at about 80 crash dumps for EnterBaselineMethod crashes (Firefox 46.0, date >= 2016-05-01, uptime > 5000) and tried to group them.

Here are the largest buckets:

-----

(1) At least 15-20% of these crashes are with our notorious "AuthenticAMD family 20 model 2 stepping 0" CPU. These crashes are all similar: we're executing the following Baseline type monitor stub:

  cmp    $0xffffff88,%ecx
  jne    L
  cmp    %edx,0x10(%edi)
  jne    L
  ret    
 L:
  mov 0x4(%edi),%edi
  jmp    *(%edi)

The first instruction is the one where we crash (EXCEPTION_ACCESS_VIOLATION_READ or  EXCEPTION_ACCESS_VIOLATION_WRITE with a low address like 0x168). Yes, that makes no sense: this compare instruction does not access any memory.

I don't see crashes in this code with any other CPU. It's not the first time this processor is causing trouble, see bug 772330 and also bug 1264188 (although the latter is mostly model 1 and this is model 2). I wonder if this could be erratum 688 or a similar bug - Baseline stubs definitely use a lot of indirect jumps and calls.

Example crash: bp-c70d9601-a96e-442b-ac05-d0ab52160501

Not sure what we should do here - we could try to emit some NOPS between the jumps and see if that helps...

-----

(2) At least 8% (7 reports) are caused by a single bit flip in ICEntry pointers in Baseline code. Baseline code calls into ICs for most bytecode ops, so a typical Baseline script has sequences of:

 mov    $0x6675cbcc,%edi <- ICEntry 1
 mov    (%edi),%edi
 call   *(%edi)
 ..
 mov    $0x6675cbd8,%edi <- ICEntry 2
 mov    (%edi),%edi
 call   *(%edi)
 ..
 mov    $0x6675cae4,%edi <- ICEntry 3
 mov    (%edi),%edi
 call   *(%edi)          <== crash

Notice that there are 12 bytes (that's sizeof(ICEntry) on x86) between ICEntry 1 and ICEntry 2. ICEntry 3 is bogus: it should be 0x6675cbe4 but it is 0x6675cae4 -- 1 bit was flipped.

These bit flips in ICEntry pointers are surprisingly common. We should probably add checks for this. Not sure what else we can do.

(This particular crash is bp-4a6a05ac-f0b7-4f75-b41f-50fbf2160501.)

-----

(3) At least 15% (13 reports) are bit flips in JIT code (either instructions or labels), for instance:

- Exhibit 1: bp-2639a76f-172c-47d2-81b4-a01162160501

 cmp    $0x1000000,%ebx
 jb     0x11a7f4e2
 cmp    $0xffffff88,%ecx
 jne    0x11a7f4d2

This is part of a post barrier in JIT code. The second jump offset should be the same as the first jump, but a bit was flipped so instead it jumps in the middle of an instruction.

(At 0x11a7f4d2 we have a 0xfb byte, that's an STI instruction that's invalid in user mode, so we crash with EXCEPTION_PRIV_INSTRUCTION.)

- Exhibit 2: bp-6a2f6ba3-7ac7-4737-a4c9-d21542160503

 1e91016:	bf e0 20 85 0a       	mov    $0xa8520e0,%edi
 1e9101b:	8b 3f                	mov    (%edi),%edi
 1e9101d:	ff 17                	call   *(%edi)

 1e9101f:	bf ec 20 85 0a       	mov    $0xa8520ec,%edi
 1e91024:	8b 3f                	mov    (%edi),%edi
 1e91026:	ff 1f                	lcall  *(%edi)

The last instruction is where we crash: a bitflip (0x17 -> 0x1f) corrupted a call instruction ("lcall" makes no sense).

There are many similar bitflips.

-----

(4) At least 14% (12 reports) are EXCEPTION_ACCESS_VIOLATION_EXEC while trying to execute memory that doesn't look like JIT code. Likely random pages that we attempt to execute because of a bug somewhere.

Many of these are probably caused by bit flips (the previous 2 categories) and we happened to end up in mapped memory instead of crashing immediately.

-----

These 4 buckets cover about 50% or so. The remaining crashes are harder to categorize, but I think a good chunk of them are caused by similar memory corruption.

I did see some crashes where we have for instance a Value with object type tag and nullptr payload, but because there are so few of them it's not clear what's going on.

Flags: needinfo?(jdemooij)

Nathan Froyd [:froydnj]

Comment 45

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #44)
> Today I looked at about 80 crash dumps for EnterBaselineMethod crashes
> (Firefox 46.0, date >= 2016-05-01, uptime > 5000) and tried to group them.
> 
> Here are the largest buckets:
> 
> -----
> 
> (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD
> family 20 model 2 stepping 0" CPU. These crashes are all similar: we're
> executing the following Baseline type monitor stub:
> 
>   cmp    $0xffffff88,%ecx
>   jne    L
>   cmp    %edx,0x10(%edi)
>   jne    L
>   ret    
>  L:
>   mov 0x4(%edi),%edi
>   jmp    *(%edi)
> 
> The first instruction is the one where we crash
> (EXCEPTION_ACCESS_VIOLATION_READ or  EXCEPTION_ACCESS_VIOLATION_WRITE with a
> low address like 0x168). Yes, that makes no sense: this compare instruction
> does not access any memory.
> 
> I don't see crashes in this code with any other CPU. It's not the first time
> this processor is causing trouble, see bug 772330 and also bug 1264188
> (although the latter is mostly model 1 and this is model 2). I wonder if
> this could be erratum 688 or a similar bug - Baseline stubs definitely use a
> lot of indirect jumps and calls.
> 
> Example crash: bp-c70d9601-a96e-442b-ac05-d0ab52160501
> 
> Not sure what we should do here - we could try to emit some NOPS between the
> jumps and see if that helps...

This sounds kind of similar to bug 772330 comment 22, where dmajor describes an AMD CPU errata.  The errata is in this doc:

http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf

The description says:

"Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly update the branch status when a taken branch occurs where the first or second instruction after the branch is an indirect call or jump. This may cause the processor to update the rIP (the instruction pointer register) after a not-taken branch that ends on the last byte of an aligned quad-word such that it appears the processor skips, and does not execute, one or more instructions. The new updated rIP due to this erratum may not be at an instruction boundary"

It's not a great matchup, but if these sort of crashes are *all* on AMD chips, the above might be plausible...

Nathan Froyd [:froydnj]

Comment 46

•

8 years ago

...and if I had read more closely, I would have seen that you referenced that exact bug and errata. =/

Luke Wagner [:luke]

Comment 47

•

8 years ago

Random idea:

What if we had a system that allocated a few scattered MiB (i.e., not all in one contiguous run or always at the same address, though being careful not to unduly increase fragmentation) with a predictable bit-pattern and periodically (say, on the daily telemetry or update ping) the system scanned all those MiBs to ensure they still had the same bit-pattern.  If a bitflip was detected, we set a flag in the browser that gets included in crash reports and also persists between browser restarts (at least for a period of time).

This could help us confirm a correlation between these catch-all JIT/GC crashes and the corruption flag and also have separate bins so that spikes in non-corruption-correlated crashes get more attention.  If we wanted to get fancy, we could even pop up a notification to the user suggesting they have bad RAM if they had the corruption flag and they were experiencing crashes :)

Andrew McCreight [:mccr8]

Comment 48

•

8 years ago

This isn't a shutdown crash as far as I can tell.

No longer blocks: shutdownkill

Whiteboard: [native-crash], ShutDownKill → [native-crash]

Andrew McCreight [:mccr8]

Updated

•

8 years ago

Depends on: 772330

Nicolas B. Pierron [:nbp]

Comment 49

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #44)
> Today I looked at about 80 crash dumps for EnterBaselineMethod crashes
> (Firefox 46.0, date >= 2016-05-01, uptime > 5000) and tried to group them.

Awesome work!

> The first instruction is the one where we crash
> (EXCEPTION_ACCESS_VIOLATION_READ or  EXCEPTION_ACCESS_VIOLATION_WRITE with a
> low address like 0x168). Yes, that makes no sense: this compare instruction
> does not access any memory.

Can we detect this cpu familly, and use the segfault handler to resume the execution?  In a similar way as operating system are emulating old instructions on newer generations of cpu.

> […] are caused by a single bit flip […]

Luke suggestion sounds interesting.  I recall people mentioning doing a memcheck as part of the safe-mode.

I do not know what cost this would have, but maybe this is something we can (randomly) do when we allocate new memory pages.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 50

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #44)
> (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD
> family 20 model 2 stepping 0" CPU. These crashes are all similar: we're
> executing the following Baseline type monitor stub:
...
> I don't see crashes in this code with any other CPU. It's not the first time
> this processor is causing trouble, see bug 772330 and also bug 1264188
> (although the latter is mostly model 1 and this is model 2). I wonder if
> this could be erratum 688 or a similar bug - Baseline stubs definitely use a
> lot of indirect jumps and calls.

My experience with the AMD crashes has been that some of them show up mostly on model 1, and others show up mostly on model 2.

> Not sure what we should do here - we could try to emit some NOPS between the
> jumps and see if that helps...

That could actually be very interesting, especially if you understand what the alignment conditions described in the erratum are referring to.  (alignment of what?)  It seems like it might be entirely possible to make these crashes go away for JIT-generated code by changing the alignment of jumps.

> These bit flips in ICEntry pointers are surprisingly common. We should
> probably add checks for this. Not sure what else we can do.

Yeah, I've seen a bunch of other crashes recently that were the result of bit flips in memory.  Though I'm curious if it's consistent with random memory being bad for such a high proportion of such crashes to end up at this particular spot.  Are there a large number of these pointers?   (For the bit flips being in JIT code... it seems more obvious to me that that's a big use of memory.)

Nicholas Nethercote [inactive]

Comment 51

•

8 years ago

Thank you for the detailed analysis, Jan.

The bitflips are scary. Jan, are you assuming that it's faulty hardware that's the cause? It sounds like others are assuming that but I can't tell if that's what you think.

I think running a memtest on certain circumstances is a great idea. How hard is it write a memtest? What circumstances would you run it under? Do we have a bug open for this idea?

Luke Wagner [:luke]

Comment 52

•

8 years ago

I remember dolske was experimenting with running a memtest a few years ago; I don't know what happened there.  I don't have any bugs on file -- just an idea while reading Jan's very interesting analysis -- sorry, don't mean to derail to more targeted discussion here.

Zack Weinberg (:zwol)

Comment 53

•

8 years ago

The mysterious category 1 could also be bit flips.  This is the machine code for the troublesome fragment:

   0:	83 f9 88             	cmp    $0xffffff88,%ecx
   3:	75 06                	jne    b <L>
   5:	39 57 10             	cmp    %edx,0x10(%edi)
   8:	75 01                	jne    b <L>
   a:	c3                   	ret    
<L>

I generated all possible one-bit flips of the first instruction.  One possibility stands out: if the first byte becomes A3, then the CPU sees

   0:	a3 f9 88 75 06       	mov    %eax,0x67588f9
   5:	39 57 10             	cmp    %edx,0x10(%edi)
   8:	75 01                	jne    b <L>
   a:	c3                   	ret    

which performs a write to memory at an address that's almost certainly inaccessible.  It's not a _low_ address, though.  It takes three bit flips to hit an instruction that could plausibly write to a low address:

   0:	89 79 88             	mov    %edi,-0x78(%ecx)
   3:	75 06                	jne    b <L>
   5:	39 57 10             	cmp    %edx,0x10(%edi)
   8:	75 01                	jne    b <L>
   a:	c3                   	ret    

Still, what with all the other cases seeming to be memory corruption, I would suggest that this is more probable than a CPU bug.

Zack Weinberg (:zwol)

Comment 54

•

8 years ago

Another one-bit flip possibility that I missed earlier:

   0:	83 b9 88 75 06 39 57 	cmpl   $0x57,0x39067588(%ecx)
   7:	10 75 01             	adc    %dh,0x1(%ebp)
   a:	c3                   	ret    

That could hit a low address depending on what's in %ecx.  (What _is_ in %ecx?)

Terrence Cole [:terrence]

Updated

•

8 years ago

Comment 55

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #51)
> Thank you for the detailed analysis, Jan.
> 
> The bitflips are scary. Jan, are you assuming that it's faulty hardware
> that's the cause? It sounds like others are assuming that but I can't tell
> if that's what you think.
> 
> I think running a memtest on certain circumstances is a great idea. How hard
> is it write a memtest? What circumstances would you run it under? Do we have
> a bug open for this idea?

Bug 995652 is our memtest on crash bug. I've filed bug 1270554 to work on memtest in the running firefox process.

Terrence Cole [:terrence]

Updated

•

8 years ago

Comment 56

•

8 years ago

Thanks for all comments. Replies below..

(In reply to Nicolas B. Pierron [:nbp] from comment #49)
> Can we detect this cpu familly, and use the segfault handler to resume the
> execution?  In a similar way as operating system are emulating old
> instructions on newer generations of cpu.

Interesting idea but it seems complicated, also because we don't really know the state of the CPU when it misbehaves.

(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) from comment #50)
> > Not sure what we should do here - we could try to emit some NOPS between the
> > jumps and see if that helps...
> 
> That could actually be very interesting, especially if you understand what
> the alignment conditions described in the erratum are referring to. 
> (alignment of what?)  It seems like it might be entirely possible to make
> these crashes go away for JIT-generated code by changing the alignment of
> jumps.

Yeah I think as a first step we could try to emit NOPS as part of this particular IC stub, and see if it makes these crashes go away.

> Though I'm curious if it's consistent with random
> memory being bad for such a high proportion of such crashes to end up at
> this particular spot.  Are there a large number of these pointers?

Yes, basically one for each interesting JS bytecode op. Also, we first emit the code and then at the end we write these pointers in it (once we know the values), so it's possible that write pattern happens to hit memory or cache lines in a way that makes it more error prone.

(In reply to Nicholas Nethercote [:njn] from comment #51)
> Jan, are you assuming that it's faulty hardware
> that's the cause? It sounds like others are assuming that but I can't tell
> if that's what you think.

I think so, yeah. In theory it could be another thread doing something like *bytePtr ^= 0x1, but that also seems unlikely. Our JIT code is usually non-writable so the window for this is pretty small. Also, on Twitter people from the Chrome/V8 teams said they've seen similar bitflips.

(In reply to Zack Weinberg (:zwol) from comment #53)
> The mysterious category 1 could also be bit flips.

That's not what I'm seeing in the memory dumps. Or do you mean a different kind of bitflip, somewhere in the CPU?

> Still, what with all the other cases seeming to be memory corruption, I
> would suggest that this is more probable than a CPU bug.

Also if it's (a) *only* this exact CPU, and (b) *always* this particular piece of JIT code and (c) this CPU is *known* to be buggy when it comes to (indirect) branches?

Jim Blandy :jimb

Comment 57

•

8 years ago

It might be easy and interesting to compute a checksum of each block of machine code, and then check it before entry. Checksums can be pretty fast.

Jim Blandy :jimb

Comment 58

•

8 years ago

... *especially* checksums that need to detect only a single bit changing, without correction. We could do giant SSE xors, 128 bits at a time, over the code. It'd be on the scale of a memset or memcpy operation.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 59

•

8 years ago

(In reply to Jan de Mooij [:jandem] from comment #44)
> (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD
> family 20 model 2 stepping 0" CPU. These crashes are all similar: we're
> executing the following Baseline type monitor stub:
> 
>   cmp    $0xffffff88,%ecx
>   jne    L
>   cmp    %edx,0x10(%edi)
>   jne    L
>   ret    
>  L:
>   mov 0x4(%edi),%edi
>   jmp    *(%edi)
> 
> The first instruction is the one where we crash
> (EXCEPTION_ACCESS_VIOLATION_READ or  EXCEPTION_ACCESS_VIOLATION_WRITE with a
> low address like 0x168). Yes, that makes no sense: this compare instruction
> does not access any memory.

FWIW, this hasn't been a characteristic of the other crashes we've seen with bug 772330.  I believe for those, it's made sense how we would have crashed at the given instruction given that we ended up there in the state we were in.

> I don't see crashes in this code with any other CPU. It's not the first time
> this processor is causing trouble, see bug 772330 and also bug 1264188
> (although the latter is mostly model 1 and this is model 2). I wonder if
> this could be erratum 688 or a similar bug - Baseline stubs definitely use a
> lot of indirect jumps and calls.
> 
> Example crash: bp-c70d9601-a96e-442b-ac05-d0ab52160501

I looked at this one a little bit, and I really don't see how we crashed.  The JIT code was:

06EF5C50 83 F9 88             cmp         ecx,0FFFFFF88h  
06EF5C53 0F 85 0A 00 00 00    jne         06EF5C63  
06EF5C59 39 57 10             cmp         dword ptr [edi+10h],edx  
06EF5C5C 0F 85 01 00 00 00    jne         06EF5C63  
06EF5C62 C3                   ret

I wonder if there's a way to transform that into something that reads from EBX with a bit flip.  (I mention EBX because the crash address is 0x80, which is the value of EBX.)  (The closest I see is 8B 3B, which is four bit flips!)

Jan de Mooij [:jandem]

Comment 60

•

8 years ago

(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) (busy May 9-13) from comment #59)
> I wonder if there's a way to transform that into something that reads from
> EBX with a bit flip.  (I mention EBX because the crash address is 0x80,
> which is the value of EBX.)

I looked at some other reports and the crash address is often the value in either EAX or EBX. Another erratum that might be relevant here:

> 578. Branch Prediction May Cause Incorrect Processor Behavior
> 
> Under a highly specific and detailed set of internal timing conditions involving
> multiple events occurring within a small window of time, the processor branch
> prediction logic may cause the processor core to decode incorrect instruction
> bytes.
> 
> Potential Effect on System
> 
> Unpredictable program behavior, generally leading to a program exception.

I think "decoding incorrect instruction" bytes fits these crashes really well. This issue has been fixed, it's possible to check CPUID bits to see if the processor has the fix.

Unfortunately there's not much information to go on so it's just guessing at this point.

Liz Henry (:lizzard) (relman/hg->git project)

Comment 61

•

8 years ago

For [@ js::jit::EnterBaselineMethod ], it's the #10 topcrash on release for 46.0.1.  It looks fairly high volume on 47 beta 4 and beta as well. It doesn't show up much in 49 or 48. Is there something new going on? Does this seem actionable at all?   

Till we know, I'll track this for 46 and 47.

status-firefox46: --- → affected

status-firefox47: --- → affected

status-firefox48: --- → affected

tracking-firefox46: --- → +

tracking-firefox47: --- → +

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Comment 62

•

8 years ago

(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #61)
> For [@ js::jit::EnterBaselineMethod ], it's the #10 topcrash on release for
> 46.0.1.  It looks fairly high volume on 47 beta 4 and beta as well. It
> doesn't show up much in 49 or 48. Is there something new going on? Does this
> seem actionable at all?   
> 
> Till we know, I'll track this for 46 and 47.

I looked at this a bit and I don't think the beta crashes are very different from release.

There's at least 1 user (on XP SP2) who submitted a pretty large number of beta crash reports.. They don't look very actionable though, maybe malware or bad hardware.

Flags: needinfo?(jdemooij)

Nicholas Nethercote [inactive]

Comment 63

•

8 years ago

> (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD
> family 20 model 2 stepping 0" CPU.
> [...]
> I don't see crashes in this code with any other CPU. It's not the first time
> this processor is causing trouble, see bug 772330 and also bug 1264188
> (although the latter is mostly model 1 and this is model 2). I wonder if
> this could be erratum 688 or a similar bug - Baseline stubs definitely use a
> lot of indirect jumps and calls.

I did some follow-up analysis on this.

TL;DR: The following CPU families have suspiciously high EnterBaselineMethod
crash rates. They are ranked from the most crashes to the least. 

> Cpu Info                                        Count
> AuthenticAMD family 16 model 6 stepping 3 | 2   18341
> AuthenticAMD family 20 model 2 stepping 0 | 2   13663
> AuthenticAMD family 22 model 0 stepping 1 | 2   6471
> AuthenticAMD family 21 model 19 stepping 1 | 2  5894
> AuthenticAMD family 16 model 6 stepping 3 | 1   233
> AuthenticAMD family 20 model 1 stepping 0 | 2   143
> AuthenticAMD family 6 model 8 stepping 1 | 1    102
> AuthenticAMD family 22 model 0 stepping 1 | 4   78

Note especially the many crashes outside of "family 20"! Perhaps AMD bugs are
more widespread than we thought? Jan, it might be worth looking at JIT crashes
in these other families.

----

Details:

I did a super search for all Firefox crashes in the past 7 days, faceted on the
"cpu info" field:

https://crash-stats.mozilla.com/search/?product=Firefox&_facets=signature&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info

I clicked on all 50 entries, opening a new tab for each one. Each tab thus held
all the Firefox crashes in the past 7 days for a single CPU family. I then went
through them all and manually extracted the rank and percentage for
EnterBaselineMethod crashes, resulting in the following table.

> Rank    Cpu info                                        Count   %       EnterBaselineMethod rank/%
> -       ALL FAMILIES COMBINED                                           #14 0.73 %
> 1       GenuineIntel family 6 model 23 stepping 10 | 2  137092  9.91 %  #10 0.92 %
> 2       GenuineIntel family 6 model 42 stepping 7 | 4   101213  7.32 %  #32 0.40 %
> 3       GenuineIntel family 6 model 58 stepping 9 | 4   98764   7.14 %  #42 0.30 %
> 4       GenuineIntel family 6 model 60 stepping 3 | 4   62630   4.53 %  (not in top 50)
> 5  *    GenuineIntel family 6 model 15 stepping 13 | 2  59451   4.30 %  #10 1.24 %
> 6       GenuineIntel family 6 model 42 stepping 7 | 2   47232   3.41 %  #24 0.47 %
> 7       GenuineIntel family 6 model 69 stepping 1 | 4   44410   3.21 %  (not in top 50)
> 8       GenuineIntel family 6 model 37 stepping 5 | 4   39198   2.83 %  #28 0.41 %
> 9       GenuineIntel family 6 model 58 stepping 9 | 2   34429   2.49 %  #24 0.46 %
> 10      ???                                             27544   1.99 %  ???
> 11      GenuineIntel family 6 model 61 stepping 4 | 4   25239   1.82 %  (not in top 50)
> 12      family 6 model 69 stepping 1 | 4                24855   1.80 %  #27 0.15 %
> 13      GenuineIntel family 6 model 60 stepping 3 | 8   23132   1.67 %  (not in top 50)
> 14      GenuineIntel family 6 model 23 stepping 6 | 2   20895   1.51 %  #15 0.75 %
> 15 ***  AuthenticAMD family 16 model 6 stepping 3 | 2   18341   1.33 %  #3  3.79 %
> 16      family 6 model 58 stepping 9 | 4                17604   1.27 %  #16 0.66 %
> 17      family 6 model 42 stepping 7 | 4                17084   1.23 %  #38 0.36 %
> 18      GenuineIntel family 6 model 37 stepping 2 | 4   16006   1.16 %  #28 0.46 %
> 19      GenuineIntel family 6 model 58 stepping 9 | 8   15807   1.14 %  #41 0.33 %
> 20      GenuineIntel family 6 model 60 stepping 3 | 2   14002   1.01 %  #30 0.36 %
> 21 ***  AuthenticAMD family 20 model 2 stepping 0 | 2   13663   0.99 %  #1  4.50 %
> 22 *    GenuineIntel family 6 model 15 stepping 11 | 2  13360   0.97 %  #9  1.16 %
> 23      family 6 model 23 stepping 10 | 2               13193   0.95 %  #17 0.75 %
> 24 *    AuthenticAMD family 16 model 6 stepping 2 | 2   12956   0.94 %  #4  1.52 %
> 25      GenuineIntel family 6 model 42 stepping 7 | 8   12083   0.87 %  #21 0.53 %
> 26      GenuineIntel family 6 model 15 stepping 2 | 2   10930   0.79 %  #13 0.91 %
> 27 *    GenuineIntel family 15 model 6 stepping 5 | 2   9988    0.72 %  #7  1.36 %
> 28      GenuineIntel family 6 model 15 stepping 6 | 2   9380    0.68 %  #13 0.81 %
> 29      GenuineIntel family 6 model 55 stepping 8 | 2   9241    0.67 %  (not in top 50)
> 30      GenuineIntel family 6 model 55 stepping 8 | 4   8767    0.63 %  (not in top 50)
> 31      family 6 model 58 stepping 9 | 8                8690    0.63 %  #28 0.41 %
> 32 *    GenuineIntel family 15 model 4 stepping 3 | 2   8179    0.59 %  #11 1.19 %
> 33 *    AuthenticAMD family 15 model 107 stepping 2 | 2 8130    0.59 %  #11 1.02 %
> 34 *    GenuineIntel family 6 model 22 stepping 1 | 1   7811    0.56 %  #9  1.13 %
> 35      GenuineIntel family 6 model 37 stepping 5 | 2   7714    0.56 %  (not in top 50)
> 36      GenuineIntel family 6 model 23 stepping 10 | 4  7568    0.55 %  #12 0.78 %
> 37 *    GenuineIntel family 15 model 4 stepping 1 | 2   6837    0.49 %  #8  1.23 %
> 38 *    GenuineIntel family 15 model 2 stepping 9 | 1   6692    0.48 %  #8  1.39 %
> 39 ***  AuthenticAMD family 22 model 0 stepping 1 | 2   6471    0.47 %  #5  4.42 %
> 40      family 6 model 70 stepping 1 | 8                6385    0.46 %  #39 0.33 %
> 41 *    GenuineIntel family 15 model 4 stepping 9 | 2   6176    0.45 %  #86 1.39 %
> 42      family 6 model 37 stepping 5 | 4                6064    0.44 %  #27 0.46 %
> 43 *    GenuineIntel family 15 model 4 stepping 1 | 1   5970    0.43 %  #7  1.57 %
> 44 ***  AuthenticAMD family 21 model 19 stepping 1 | 2  5894    0.43 %  #3  3.89 %
> 45      GenuineIntel family 6 model 28 stepping 10 | 2  5522    0.40 %  #23 0.42 %
> 46 *    AuthenticAMD family 18 model 1 stepping 0 | 2   5491    0.40 %  #5  1.60 %
> 47      GenuineIntel family 6 model 78 stepping 3 | 4   5489    0.40 %  (not in top 50)
> 48      family 6 model 42 stepping 7 | 8                5387    0.39 %  #27 0.48 %
> 49      GenuineIntel family 6 model 45 stepping 7 | 4   5368    0.39 %  (not in top 50)
> 50 *    AuthenticAMD family 21 model 16 stepping 1 | 2  5353    0.39 %  #9  1.20 %

Over all CPU families, EnterBaselineMethod crashes were 0.73% of all crashes.
Looking at individual CPU families, four of them stood out as having
EnterBaselineMethod crash rates in the range 3.79--4.50%. These are marked with
'***'. I also marked ones with an EnterBaselineMethod crash rate greater than
1% with '*', but those could just be natural variation.

I then searched for all EnterBaselineMethod crashes in Firefox in the past 7
days, faceted by Cpu Info:

https://crash-stats.mozilla.com/search/?product=Firefox&signature=%3Djs%3A%3Ajit%3A%3AEnterBaselineMethod&_facets=signature&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info

I then cross-correlated these ranks with the ranks in the table above, giving this table:

>                                                                             rank in table 1
> 1       GenuineIntel family 6 model 23 stepping 10 | 2  1262    12.56 %     #1 
> 2       GenuineIntel family 6 model 15 stepping 13 | 2  738     7.34 %      #5
> 3       AuthenticAMD family 16 model 6 stepping 3 | 2   695     6.92 %      #15 ***
> 4       AuthenticAMD family 20 model 2 stepping 0 | 2   619     6.16 %      #21 ***
> 5       GenuineIntel family 6 model 42 stepping 7 | 4   404     4.02 %      #2
> 6       GenuineIntel family 6 model 58 stepping 9 | 4   292     2.91 %      #3
> 7       AuthenticAMD family 22 model 0 stepping 1 | 2   285     2.84 %      #39 ***
> 8       AuthenticAMD family 16 model 6 stepping 3 | 1   233     2.32 %      N/A ***
> 9       AuthenticAMD family 21 model 19 stepping 1 | 2  228     2.27 %      #44 ***
> 10      GenuineIntel family 6 model 42 stepping 7 | 2   220     2.19 %      #6
> 11      AuthenticAMD family 16 model 6 stepping 2 | 2   196     1.95 %      #24 *
> 12      GenuineIntel family 6 model 37 stepping 5 | 4   162     1.61 %      #8
> 13      GenuineIntel family 6 model 58 stepping 9 | 2   159     1.58 %      #9
> 14      GenuineIntel family 6 model 23 stepping 6 | 2   157     1.56 %      #14
> 15      GenuineIntel family 6 model 15 stepping 11 | 2  155     1.54 %      #22
> 16      AuthenticAMD family 20 model 1 stepping 0 | 2   143     1.42 %      N/A ***
> 17      GenuineIntel family 15 model 6 stepping 5 | 2   135     1.34 %      #27
> 18      GenuineIntel family 6 model 60 stepping 3 | 4   118     1.17 %      #4
> 19      AuthenticAMD family 6 model 8 stepping 1 | 1    103     1.02 %      N/A ***
> 20      GenuineIntel family 6 model 15 stepping 2 | 2   99      0.99 %      #26
> 21      GenuineIntel family 15 model 4 stepping 3 | 2   95      0.95 %      #32 *
> 22      GenuineIntel family 15 model 4 stepping 1 | 1   94      0.94 %      #43 *
> 23      GenuineIntel family 15 model 2 stepping 9 | 1   93      0.93 %      #38 *
> 24      AuthenticAMD family 18 model 1 stepping 0 | 2   90      0.90 %      #46 *
> 25      GenuineIntel family 6 model 22 stepping 1 | 1   89      0.89 %      #34 *
> 26      GenuineIntel family 15 model 4 stepping 9 | 2   85      0.85 %      #41 *
> 27      GenuineIntel family 15 model 4 stepping 1 | 2   84      0.84 %      #37 *
> 28      AuthenticAMD family 15 model 107 stepping 2 | 2 83      0.83 %      #33
> 29      AuthenticAMD family 22 model 0 stepping 1 | 4   78      0.78 %      N/A **
> 30      GenuineIntel family 6 model 15 stepping 6 | 2   76      0.76 %      #28
> 31      AuthenticAMD family 6 model 10 stepping 0 | 1   75      0.75 %      N/A **
> 32      GenuineIntel family 6 model 37 stepping 2 | 4   72      0.72 %      #18
> 33      GenuineIntel family 6 model 69 stepping 1 | 4   71      0.71 %      #7
> 34      GenuineIntel family 15 model 6 stepping 5 | 1   67      0.67 %      N/A *
> 35      AuthenticAMD family 21 model 16 stepping 1 | 2  64      0.64 %      #50 *
> 36      GenuineIntel family 6 model 42 stepping 7 | 8   64      0.64 %      #25
> 37      AuthenticAMD family 16 model 5 stepping 3 | 3   58      0.58 %      N/A *
> 38      AuthenticAMD family 21 model 16 stepping 1 | 4  58      0.58 %      N/A *
> 39      GenuineIntel family 6 model 23 stepping 10 | 4  57      0.57 %      #36
> 40      AuthenticAMD family 16 model 6 stepping 2 | 1   55      0.55 %      N/A *
> 41      AuthenticAMD family 21 model 48 stepping 1 | 4  55      0.55 %      N/A *
> 42      GenuineIntel family 15 model 4 stepping 9 | 1   55      0.55 %      N/A
> 43      GenuineIntel family 6 model 58 stepping 9 | 8   53      0.53 %      #19
> 44      GenuineIntel family 6 model 60 stepping 3 | 2   50      0.50 %      #20
> 45      AuthenticAMD family 18 model 1 stepping 0 | 4   49      0.49 %      N/A
> 46      GenuineIntel family 6 model 60 stepping 3 | 8   47      0.47 %      #13
> 47      AuthenticAMD family 16 model 4 stepping 3 | 4   46      0.46 %      N/A
> 48      AuthenticAMD family 15 model 75 stepping 2 | 2  43      0.43 %      N/A
> 49      AuthenticAMD family 16 model 5 stepping 3 | 4   43      0.43 %      N/A
> 50      GenuineIntel family 15 model 2 stepping 7 | 1   41      0.41 %      N/A

The relative position of a CPU family in the two tables indicates its crash rate.
For example, the first entry is unsurprising -- "GenuineIntel family 6 model 23
stepping 10" is the #1 family with an EnterBaselineMethod, but it's also the #1
CPU overall.

But entries #3 and #4 in this table had much lower rankings in the first table,
which suggests they have unusually high EnterBaselineMethod crash rates, and
indeed they were two of the previously-identified suspicious ones.

There are also some entries that show up reasonably high in this table, but
didn't show up at all in the previous table. So I looked them up and this gave
us a few more entries that could be added to the first table:

> ??      AuthenticAMD family 16 model 6 stepping 3 | 1   233     ?.?? %  #?  5.34 %
> ??      AuthenticAMD family 20 model 1 stepping 0 | 2   143     ?.?? %  #?  3.63 %
> ??      AuthenticAMD family 6 model 8 stepping 1 | 1    102     ?.?? %  #?  2.66 %
> ??      AuthenticAMD family 22 model 0 stepping 1 | 4   78      ?.?? %  #?  1.68 %

The first entry here, despite having a low number of crashes -- it must just be
an uncommon CPU family -- had an even higher EnterBaselineMethod crash rate of 5.34%.

This analysis isn't perfect because other crash signatures may also have
correlations against CPU family. Ideally we'd match the EnterBaselineMethod
crash rates for each CPU family against the CPU family usage among our user
population, perhaps from telemetry data.

Jan de Mooij [:jandem]

Updated

•

8 years ago

Depends on: amdbug

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

status-firefox34: affected → wontfix

status-firefox35: affected → wontfix

status-firefox39: affected → wontfix

status-firefox41: ? → wontfix

status-firefox42: ? → wontfix

status-firefox43: affected → wontfix

status-firefox44: ? → wontfix

status-firefox45: ? → wontfix

status-firefox46: affected → wontfix

status-firefox47: affected → wontfix

BugBot [:suhaib / :marco/ :calixte]

Comment 64

•

8 years ago

Crash volume for signature 'js::jit::EnterBaselineMethod':
 - nightly (version 50): 3 crashes from 2016-06-06.
 - aurora  (version 49): 6 crashes from 2016-06-07.
 - esr     (version 45): 1324 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          2          0          0          0          0          1
 - aurora           3          1          0          0          1          1          0
 - esr            197        161        157        142        176        155         89

Affected platforms: Windows, Mac OS X, Linux

status-firefox49: --- → affected

status-firefox50: --- → affected

status-firefox-esr45: --- → affected

Sylvestre Ledru [:Sylvestre]

Comment 65

•

8 years ago

hmm, if this is the amd bug, that means we had a esr version being impacted...

Nicolas B. Pierron [:nbp]

Comment 66

•

8 years ago

Based on comment 44, I would expect us to still have a non-zero baseline of crashes, especially on old hardware.  I guess the likelyhood of using release / esr version might be higher on old hardware.

Bug 1281759 only landed in Gecko 50, so this should not have changed aurora.

Could this be a problem with the crash reporter when we have no stack frame at the top?  Or maybe we discard these reports?  Or they are classified with a bunch of different signature?

Marco Castelluccio [:marco]

Updated

•

8 years ago

Comment 67

•

8 years ago

Crash volume for signature 'js::jit::EnterBaselineMethod':
 - nightly (version 51): 1 crash from 2016-08-01.
 - aurora  (version 50): 1 crash from 2016-08-01.
 - beta    (version 49): 77 crashes from 2016-08-02.
 - release (version 48): 6134 crashes from 2016-07-25.
 - esr     (version 45): 1674 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       0       0
 - aurora        1       0       0
 - beta         28      22       9
 - release    1970    1820    1015
 - esr          68      47     121

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly #730
 - aurora
 - beta    #589      #434
 - release #9        #4
 - esr     #156

status-firefox51: --- → affected

Marco Castelluccio [:marco]

Updated

•

8 years ago

Comment 68

•

8 years ago

Crash volume for signature 'js::jit::EnterBaselineMethod':
 - nightly (version 52): 2 crashes from 2016-09-19.
 - aurora  (version 51): 0 crashes from 2016-09-19.
 - beta    (version 50): 41 crashes from 2016-09-20.
 - release (version 49): 94 crashes from 2016-09-05.
 - esr     (version 45): 1728 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly       0       2
 - aurora        0       0
 - beta         34       7
 - release      67      27
 - esr         200     186

Affected platforms: Windows, Linux

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly
 - aurora
 - beta    #440      #631
 - release #975      #474
 - esr     #57

status-firefox52: --- → affected

Nicolas B. Pierron [:nbp]

Updated

•

8 years ago

Priority: -- → P3

Julien Cristau [:jcristau] (back April 22)

Comment 69

•

7 years ago

Mass wontfix for bugs affecting firefox 52.

status-firefox52: affected → wontfix

Mike Taylor [:miketaylr]

Updated

•

7 years ago

Crash Signature: [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)] [@ js::jit::EnterBaselineMethod] → [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)] [@ js::jit::EnterBaselineMethod] [@ EnterJit]

Jan de Mooij [:jandem]

Comment 71

•

6 years ago

Adding this to our crash triage list.

Assignee: jdemooij → nobody

Whiteboard: [native-crash] → [native-crash][#jsapi:crashes-retriage]

Ted Campbell [:tcampbell]

Comment 72

•

6 years ago

Closing in favor of meta-bug Bug 858032. Current investigations branch off there.

Blocks: SadJit

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → INCOMPLETE

Whiteboard: [native-crash][#jsapi:crashes-retriage] → [native-crash]

mirh

Comment 73

•

6 years ago

If this was due (or at least partially) to bug 1281759 (I don't know, might or might not be, updated stats wouldn't hurt I guess), then I'm not sure how smart it could be to refer to a meta issue.

Ted Campbell [:tcampbell]

Comment 74

•

6 years ago

The signature encompasses a number of reasons. There have also been numerous renames of JIT signatures which has added to confusion. The meta-bug should be referring to Bug 1281759 as one of the source of crashes.

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox48: affected → wontfix

status-firefox49: affected → wontfix

status-firefox50: affected → wontfix

status-firefox51: affected → wontfix

status-firefox-esr45: affected → wontfix