crash in js::jit::EnterBaselineMethod(JSContext*, js::RunState&)

RESOLVED INCOMPLETE

Status

()

defect
P3
critical
RESOLVED INCOMPLETE
5 years ago
a year ago

People

(Reporter: ashughes, Unassigned)

Tracking

(Depends on 1 bug, Blocks 1 bug, {crash})

31 Branch
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s-, firefox30 unaffected, firefox31- wontfix, firefox32+ wontfix, firefox33+ wontfix, firefox34 wontfix, firefox35 wontfix, firefox39- wontfix, firefox41 wontfix, firefox42 wontfix, firefox43 wontfix, firefox44 wontfix, firefox45 wontfix, firefox46+ wontfix, firefox47+ wontfix, firefox48 affected, firefox49 affected, firefox-esr45 affected, firefox50 affected, firefox51 affected, firefox52 wontfix)

Details

(Whiteboard: [native-crash], crash signature)

This bug was filed from the Socorro interface and is 
report bp-23cc14e5-2e4f-4f96-ab95-2cf572140627.
=============================================================
0 	mozjs.dll 	js::jit::EnterBaselineMethod(JSContext *,js::RunState &) 	js/src/jit/BaselineJIT.cpp
1 	mozjs.dll 	Interpret 	js/src/vm/Interpreter.cpp
2 	mozjs.dll 	js::RunScript(JSContext *,js::RunState &) 	js/src/vm/Interpreter.cpp
3 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
4 	mozjs.dll 	js_fun_apply(JSContext *,unsigned int,JS::Value *) 	js/src/jsfun.cpp
5 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
6 	mozjs.dll 	Interpret 	js/src/vm/Interpreter.cpp
7 	mozjs.dll 	js::RunScript(JSContext *,js::RunState &) 	js/src/vm/Interpreter.cpp
8 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
9 	mozjs.dll 	js_fun_apply(JSContext *,unsigned int,JS::Value *) 	js/src/jsfun.cpp
10 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
11 	mozjs.dll 	Interpret 	js/src/vm/Interpreter.cpp
12 	mozjs.dll 	js::RunScript(JSContext *,js::RunState &) 	js/src/vm/Interpreter.cpp
13 	mozjs.dll 	js::Invoke(JSContext *,JS::CallArgs,js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
14 	mozjs.dll 	js::Invoke(JSContext *,JS::Value const &,JS::Value const &,unsigned int,JS::Value const *,JS::MutableHandle<JS::Value>) 	js/src/vm/Interpreter.cpp
15 	mozjs.dll 	JS::Call(JSContext *,JS::Handle<JS::Value>,JS::Handle<JS::Value>,JS::HandleValueArray const &,JS::MutableHandle<JS::Value>) 	js/src/jsapi.cpp
16 	xul.dll 	mozilla::dom::EventListener::HandleEvent(JSContext *,JS::Handle<JS::Value>,mozilla::dom::Event &,mozilla::ErrorResult &) 	obj-firefox/dom/bindings/EventListenerBinding.cpp
17 	xul.dll 	mozilla::dom::EventListener::HandleEvent<mozilla::dom::EventTarget *>(mozilla::dom::EventTarget * const &,mozilla::dom::Event &,mozilla::ErrorResult &,mozilla::dom::CallbackObject::ExceptionHandling) 	obj-firefox/dist/include/mozilla/dom/EventListenerBinding.h
18 	xul.dll 	mozilla::EventListenerManager::HandleEventSubType(mozilla::EventListenerManager::Listener *,nsIDOMEvent *,mozilla::dom::EventTarget *) 	dom/events/EventListenerManager.cpp
19 	xul.dll 	mozilla::EventTargetChainItem::HandleEventTargetChain(nsTArray<mozilla::EventTargetChainItem> &,mozilla::EventChainPostVisitor &,mozilla::EventDispatchingCallback *,mozilla::ELMCreationDetector &) 	dom/events/EventDispatcher.cpp
20 	xul.dll 	mozilla::EventDispatcher::Dispatch(nsISupports *,nsPresContext *,mozilla::WidgetEvent *,nsIDOMEvent *,nsEventStatus *,mozilla::EventDispatchingCallback *,nsCOMArray<mozilla::dom::EventTarget> *) 	dom/events/EventDispatcher.cpp
=============================================================
More reports: https://crash-stats.mozilla.com/report/list?product=Firefox&signature=js%3A%3Ajit%3A%3AEnterBaselineMethod%28JSContext%2A%2C+js%3A%3ARunState%26%29

This is the same signature as a recent B2G topcrasher (bug 978450) but affects Desktop Firefox. This signature has been around on Desktop for a long time but has recently exploded on Beta by an extreme margin starting on 2014-07-02.
https://crash-analysis.mozilla.com/rkaiser/2014-07-03/2014-07-03.firefox.31.explosiveness.html

This is currently #37 across 7-days and #22 across 3-days. While not strictly a "topcrash" yet I'm marking it as such based on explosiveness.

Looking at the product correlation the volume is really high on the latest Beta compared to the previous Beta, and is really high on the latest Nightly compared to the latest Aurora.

> Firefox 31.0b6: 55.56%
> Firefox 33.0a1: 32.98%
> Firefox 32.0a2: 5.38%
> Firefox 31.0b5: 2.05%

Crashes per Install seems to indicate people are crashing here more than once:
> Firefox 31.0b6: 785 crashes per 622 installs
> Firefox 33.0a1: 466 crashes per 214 installs 

Facebook seems to be the top URL in the correlations by far.
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #0)
> This bug was filed from the Socorro interface and is 
> report bp-23cc14e5-2e4f-4f96-ab95-2cf572140627.
> =============================================================
> 0 	mozjs.dll 	js::jit::EnterBaselineMethod(JSContext *,js::RunState &) 
> js/src/jit/BaselineJIT.cpp
> 1 	mozjs.dll 	Interpret 	js/src/vm/Interpreter.cpp
> 2 	mozjs.dll 	js::RunScript(JSContext *,js::RunState &) 
> =============================================================

In general such stack (EnterBaselineMethod) is useless as we enter some generated code, we we do not know what code is being executed when these crashes are happening.

> Looking at the product correlation the volume is really high on the latest
> Beta compared to the previous Beta, and is really high on the latest Nightly
> compared to the latest Aurora.
> 
> > Firefox 31.0b6: 55.56%
> > Firefox 33.0a1: 32.98%
> > Firefox 32.0a2: 5.38%
> > Firefox 31.0b5: 2.05%
> 

Changelog from Firefox 31.0b5 to Firefox 31.0b6:
http://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=a04918ac3197&tochange=9f7d43269809

Terrence, could that be related to Bug 1028358?
Anthony, can somebody from QA find a way to reproduce this issue?
Flags: needinfo?(terrence)
Flags: needinfo?(anthony.s.hughes)
(In reply to Nicolas B. Pierron [:nbp] from comment #1)
> Anthony, can somebody from QA find a way to reproduce this issue?

There's really nothing useful in any of the reports to help guide testing. Is there anything in the pushlog which stands out that we could test around?
Flags: needinfo?(anthony.s.hughes)
(In reply to Nicolas B. Pierron [:nbp] from comment #1)
> 
> Changelog from Firefox 31.0b5 to Firefox 31.0b6:
> http://hg.mozilla.org/releases/mozilla-beta/
> pushloghtml?fromchange=a04918ac3197&tochange=9f7d43269809
> 
> Terrence, could that be related to Bug 1028358?

I don't think so. That barrier code is not used by the jits, it would only increase the live set anyway, and the crash is a null deref, not a UAF. I don't think GC is likely to be implicated here.

> Anthony, can somebody from QA find a way to reproduce this issue?
Flags: needinfo?(terrence)
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #2)
> (In reply to Nicolas B. Pierron [:nbp] from comment #1)
> > Anthony, can somebody from QA find a way to reproduce this issue?
> 
> There's really nothing useful in any of the reports to help guide testing.

No, reports with EnterBaseline are just saying “Hey we are executing some JavaScript that we have executed more than 10 times before”.

Which does not help to find what is the context of the failure.

> Is there anything in the pushlog which stands out that we could test around?

I look at it, and the only commit which stand out is Bug 1028358, but Terrence replied to this hypothesis in comment 3.

The other option would be that this is something new in facebook pages (comment 0), which is causing more failures by highlighting one existing bug which might be in the tree since a moment.
Untracking. No activity and too late for 31
I'm marking as won't fix for 32 as there has been no activity. ni Naveed to help get this top crash prioritized.
Flags: needinfo?(nihsanullah)
Flags: needinfo?(nihsanullah)
Jan, so, how do the stats look like? Thanks
Flags: needinfo?(jdemooij)
(In reply to Sylvestre Ledru [:sylvestre] from comment #8)
> Jan, so, how do the stats look like? Thanks

The fix mentioned in comment 7 helped a bit (and is in 32). But EnterBaselineMethod is still at #4 for 32, #10 for 33.

Unfortunately (top-)crashes in JIT code are not a new thing; we've had them since the first Firefox releases with a JIT. I looked at some of the crash reports recently and most of those were caused by memory corruption that's impossible to track down... It could even be code outside the JS engine that's misbehaving and corrupting our code.

I'll keep an eye on crash-stats though.
Flags: needinfo?(jdemooij)
OK. Thanks for the feedback.
I guess this is going to be a wontfix for 33.
Given comment 9, is there anything else that we can do in this bug?
Flags: needinfo?(jdemooij)
(In reply to Lawrence Mandel [:lmandel] from comment #12)
> Given comment 9, is there anything else that we can do in this bug?

If there's a new spike or a website that crashes reliably we'd be happy to investigate and fix it, but the current crashes look like random memory corruption and there's not much we can do.

This bug is not really actionable, so I don't know if we should track it.
Flags: needinfo?(jdemooij)
Kairo/Anthony - Is this still a topcrash in 33/34/35? If so, is there any more information that you can provide to assist with debugging? If not, this looks like a resolved/incomplete to me.
Flags: needinfo?(kairo)
Flags: needinfo?(anthony.s.hughes)
I looked over the stats for this signature and this does not seem to qualify as a topcrash anymore, though it is still affecting some users.

> 33.0*: 90 reports
> 34.0*: 28 reports
> 35.0*: 1 report
> 36.0*: 25 reports
https://crash-stats.mozilla.com/report/list?product=Firefox&range_value=7&range_unit=days&date=2014-10-22&signature=js%3A%3Ajit%3A%3AEnterBaselineMethod%28JSContext*%2C+js%3A%3ARunState%26%29
Flags: needinfo?(kairo)
Flags: needinfo?(anthony.s.hughes)
Keywords: topcrash-win
Given the data in comment 15 and the lack of additional information for debugging, I think this can likely be resolved. I want to wait until at least tomorrow to give Kairo a chance to comment.

Comment 17

5 years ago
Well, if we resolve it, we might need another bug for tracking the ongoing (but unactionable probably) amount of crashes we have all the time with this signature, which probably in reality is all kinds of different things crashing actually *inside* baseline-compiled code.
I'm going to leave this open so that we have somewhere to track (per Kairo in comment 17) but am dropping tracking as this is currently inactionable.
This signature now affects Developer Edition 39.0a2 2015-03-30 win32 builds under Windows at start-up. The builds are unusable. 

Win64 builds are not affected under Windows. 
Linux and Mac builds can be started and used.
Naveed, seems like we need your help! Could you help us with that? Thanks (this is critical as we cannot reenable 39 aurora updates).
Flags: needinfo?(nihsanullah)
Ugh, we have the same issue in automation at the moment in bug 1149377. I'm working on bisecting it now, but being pgo-only isn't helping.
See Also: → 1149377
FWIW, this bug clearly pre-dates whatever's going on with Aurora since yesterday's uplift. I think we should track the new problem over in bug 1149377 rather than this one.
Stop tracking this one and tracking bug 1149377 instead.

Comment 24

4 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #22)
> FWIW, this bug clearly pre-dates whatever's going on with Aurora since
> yesterday's uplift. I think we should track the new problem over in bug
> 1149377 rather than this one.

Yes, the signature in here is pretty much a catch-all for a class of crashes in the Baseline JIT.
nbp and jandem are working on the current issue in bug 1149377. 

For the next time we end up here: This stack by itself (and therefore this specific bug) is not really actionable. It may imply a code generation problem or an exception occurred while processing warm JS. Bisection or another hint will probably be needed to work the issue and a more specific bug should be opened.
Flags: needinfo?(nihsanullah)

Comment 26

4 years ago
(In reply to Naveed Ihsanullah [:naveed] from comment #25)
> nbp and jandem are working on the current issue in bug 1149377. 
> 
> For the next time we end up here: This stack by itself (and therefore this
> specific bug) is not really actionable. It may imply a code generation
> problem or an exception occurred while processing warm JS. Bisection or
> another hint will probably be needed to work the issue and a more specific
> bug should be opened.

¡Hola Naveed!

FWIW I've filed https://bugzilla.mozilla.org/show_bug.cgi?id=1200685

Hope it is useful else let me know and I'd close it =)

¡Gracias!
Flags: needinfo?(nihsanullah)
Ill pass the bug on to Jan. I don't see any additional actionable information in that bug but perhaps Jan can tell more. 

Jan can we instrument the code for these class of crashes so more information is available to us in the crash reports?
Flags: needinfo?(nihsanullah)
Flags: needinfo?(jdemooij)

Updated

4 years ago
Crash Signature: [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)] → [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)] [@ js::jit::EnterBaselineMethod]
Duplicate of this bug: 1200685
"Assignee:" taken over from Bug 1200685.
Assignee: nobody → jdemooij
Blocks: shutdownkill
Whiteboard: ShutDownKill
Duplicate of this bug: 956980
From Bug 956980 ...

Summary: crash in js::jit::EnterBaselineMethod(JSContext*, js::RunState&) mostly with cached documents

https://bugzilla.mozilla.org/show_bug.cgi?id=956980#c0
(In reply to Kevin Brosnan [:kbrosnan] from comment #0)
> This bug was filed from the Socorro interface and is 
> report bp-77126f40-348c-46eb-9f74-79c772140106.
> =============================================================
> 
> Nothing useful in comments. Almost all the URLs have wyciwyg which suggests
> the documents were retrieved from the cache. Wired URLs represent 10 out of
> the 13 submitted URLs.
> 
> wyciwyg://0/http://www.wired.com/opinion/2013/11/so-the-internets-about-to-
> lose-its-net-neutrality/
> 
> wyciwyg://0/http://www.wired.com/opinion/2012/11/cease-and-desist-manuals-
> planned-obsolescence/
> 
> There are two non-cache URLs and those are
> 
> http://www.photoprikol.net/photo/138-igrushki-sssr-72-foto.html
> 
> https://www.facebook.com/
Whiteboard: ShutDownKill → [native-crash], ShutDownKill
+ Emails from the dups ...
(In reply to Naveed Ihsanullah [:naveed] from comment #27)
> Ill pass the bug on to Jan. I don't see any additional actionable
> information in that bug but perhaps Jan can tell more. 
> 
> Jan can we instrument the code for these class of crashes so more
> information is available to us in the crash reports?

Yeah these crashes aren't really actionable. JIT crashes are caused by different bugs and many of the reports are random memory corruption. We want to hear about spikes and reproducible cases though.

Making JIT code non-writable may help us catch memory corruption bugs sooner/elsewhere. That's bug 1215479 but it's pretty hard to do without regressing performance.
Flags: needinfo?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #33)
> Making JIT code non-writable may help us catch memory corruption bugs
> sooner/elsewhere. That's bug 1215479 but it's pretty hard to do without
> regressing performance.

Could we only re-protect the code 1/10th of the time?  Thus, amortize the cost of protecting the pages, and potentially catch some of these other issues without huge performance regressions, while providing a better crash-stack.
From the crash signature [@ js::jit::EnterBaselineMethod ], the affected versions are:
- Nightly: 47
- Aurora: 46, 45
- Beta: 45.0b1, 45.0b2, 44.0b99, 44.0b1, 44.0b9, 44.0b8, 44.0b6, 44.0b2, 44.0b7

In the crash signature [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&) ] there are no reports in the last 28 days.
Currently, for the past 7 days, there are 2800 crashes reported for beta and only 12 reported on nightly for [@ js::jit::EnterBaselineMethod]

Comment 37

3 years ago
important
(In reply to [:tracy] Tracy Walker from comment #36)
> Currently, for the past 7 days, there are 2800 crashes reported for beta and
> only 12 reported on nightly for [@ js::jit::EnterBaselineMethod]

As mentioned all along this bug, this signature is not actionable.
To investigate such issues, here are some of the fastest ways forward:
 - Reproduce the issue with one of the reported URL.
 - List all backported patches, since the last version. (comment 25)
 - Find an actionable existing bug which highlights the same crash characteristics (crash address, stack pointer, …).

With none of these information, I would not expect any investigation from the JS Team as we are likely to arm our users more with random urgent fixes.
This is a generic crash that doesn't appear to afflict e10s more or less than non-e10s. Untracking.

45.0b6 content process crashes - 228
45.0b6 crashes with e10s disabled - 2165
The percentage of beta users running e10s during our experiment was about 10%.
No longer blocks: e10s-crashes
Looking at beta 46 (5, 6, 7) experiment crash data, this shows up twice as often under e10s. It is also the #8 top crasher.
Blocks: e10s-crashes
(In reply to Jim Mathies [:jimm] from comment #40)
> Jan, any suggestions here on how to proceed with this under e10s?

I looked at some of these beta crash dumps and it's the usual mix. Most common are:

* Valid JIT code but some invalid bytes in the middle. This is pretty weird and code corruption should be less likely now with W^X (also in 46). I'm currently tracking down a related crash in bug 1124397. That's probably some other thread misbehaving. I'll continue working on that one.

* Valid JIT code but reading/writing invalid memory. JIT code accesses a lot of things and this is probably similar to the GC topcrashes we have.

* Some crashes remind me of bug 1260721. I'll see what we can do there.

Unfortunately most of these look like random memory corruption. If these crashes are worse with e10s, maybe we have some heap corruption bugs there?
Flags: needinfo?(jdemooij)
Still a current problem. Firefox crashes in less than 2 mins after opening. Open a new tab, open a browser. Many times, when the page attemps to render it crashes. About 5 different crash reasons. Says not a plugin crash
#2 topcrash on 46 release right now (pretty high volume, just under OOM crashes). e10s should be disabled on release. People are complaining that they are hitting the crash after updating. The crash spike may also be correlated with AV software (see bug 1268025)
Flags: needinfo?(jdemooij)
Today I looked at about 80 crash dumps for EnterBaselineMethod crashes (Firefox 46.0, date >= 2016-05-01, uptime > 5000) and tried to group them.

Here are the largest buckets:

-----

(1) At least 15-20% of these crashes are with our notorious "AuthenticAMD family 20 model 2 stepping 0" CPU. These crashes are all similar: we're executing the following Baseline type monitor stub:

  cmp    $0xffffff88,%ecx
  jne    L
  cmp    %edx,0x10(%edi)
  jne    L
  ret    
 L:
  mov 0x4(%edi),%edi
  jmp    *(%edi)

The first instruction is the one where we crash (EXCEPTION_ACCESS_VIOLATION_READ or  EXCEPTION_ACCESS_VIOLATION_WRITE with a low address like 0x168). Yes, that makes no sense: this compare instruction does not access any memory.

I don't see crashes in this code with any other CPU. It's not the first time this processor is causing trouble, see bug 772330 and also bug 1264188 (although the latter is mostly model 1 and this is model 2). I wonder if this could be erratum 688 or a similar bug - Baseline stubs definitely use a lot of indirect jumps and calls.

Example crash: bp-c70d9601-a96e-442b-ac05-d0ab52160501

Not sure what we should do here - we could try to emit some NOPS between the jumps and see if that helps...

-----

(2) At least 8% (7 reports) are caused by a single bit flip in ICEntry pointers in Baseline code. Baseline code calls into ICs for most bytecode ops, so a typical Baseline script has sequences of:

 mov    $0x6675cbcc,%edi <- ICEntry 1
 mov    (%edi),%edi
 call   *(%edi)
 ..
 mov    $0x6675cbd8,%edi <- ICEntry 2
 mov    (%edi),%edi
 call   *(%edi)
 ..
 mov    $0x6675cae4,%edi <- ICEntry 3
 mov    (%edi),%edi
 call   *(%edi)          <== crash

Notice that there are 12 bytes (that's sizeof(ICEntry) on x86) between ICEntry 1 and ICEntry 2. ICEntry 3 is bogus: it should be 0x6675cbe4 but it is 0x6675cae4 -- 1 bit was flipped.

These bit flips in ICEntry pointers are surprisingly common. We should probably add checks for this. Not sure what else we can do.

(This particular crash is bp-4a6a05ac-f0b7-4f75-b41f-50fbf2160501.)

-----

(3) At least 15% (13 reports) are bit flips in JIT code (either instructions or labels), for instance:

- Exhibit 1: bp-2639a76f-172c-47d2-81b4-a01162160501

 cmp    $0x1000000,%ebx
 jb     0x11a7f4e2
 cmp    $0xffffff88,%ecx
 jne    0x11a7f4d2

This is part of a post barrier in JIT code. The second jump offset should be the same as the first jump, but a bit was flipped so instead it jumps in the middle of an instruction.

(At 0x11a7f4d2 we have a 0xfb byte, that's an STI instruction that's invalid in user mode, so we crash with EXCEPTION_PRIV_INSTRUCTION.)

- Exhibit 2: bp-6a2f6ba3-7ac7-4737-a4c9-d21542160503

 1e91016:	bf e0 20 85 0a       	mov    $0xa8520e0,%edi
 1e9101b:	8b 3f                	mov    (%edi),%edi
 1e9101d:	ff 17                	call   *(%edi)

 1e9101f:	bf ec 20 85 0a       	mov    $0xa8520ec,%edi
 1e91024:	8b 3f                	mov    (%edi),%edi
 1e91026:	ff 1f                	lcall  *(%edi)

The last instruction is where we crash: a bitflip (0x17 -> 0x1f) corrupted a call instruction ("lcall" makes no sense).

There are many similar bitflips.

-----

(4) At least 14% (12 reports) are EXCEPTION_ACCESS_VIOLATION_EXEC while trying to execute memory that doesn't look like JIT code. Likely random pages that we attempt to execute because of a bug somewhere.

Many of these are probably caused by bit flips (the previous 2 categories) and we happened to end up in mapped memory instead of crashing immediately.

-----

These 4 buckets cover about 50% or so. The remaining crashes are harder to categorize, but I think a good chunk of them are caused by similar memory corruption.

I did see some crashes where we have for instance a Value with object type tag and nullptr payload, but because there are so few of them it's not clear what's going on.
Flags: needinfo?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #44)
> Today I looked at about 80 crash dumps for EnterBaselineMethod crashes
> (Firefox 46.0, date >= 2016-05-01, uptime > 5000) and tried to group them.
> 
> Here are the largest buckets:
> 
> -----
> 
> (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD
> family 20 model 2 stepping 0" CPU. These crashes are all similar: we're
> executing the following Baseline type monitor stub:
> 
>   cmp    $0xffffff88,%ecx
>   jne    L
>   cmp    %edx,0x10(%edi)
>   jne    L
>   ret    
>  L:
>   mov 0x4(%edi),%edi
>   jmp    *(%edi)
> 
> The first instruction is the one where we crash
> (EXCEPTION_ACCESS_VIOLATION_READ or  EXCEPTION_ACCESS_VIOLATION_WRITE with a
> low address like 0x168). Yes, that makes no sense: this compare instruction
> does not access any memory.
> 
> I don't see crashes in this code with any other CPU. It's not the first time
> this processor is causing trouble, see bug 772330 and also bug 1264188
> (although the latter is mostly model 1 and this is model 2). I wonder if
> this could be erratum 688 or a similar bug - Baseline stubs definitely use a
> lot of indirect jumps and calls.
> 
> Example crash: bp-c70d9601-a96e-442b-ac05-d0ab52160501
> 
> Not sure what we should do here - we could try to emit some NOPS between the
> jumps and see if that helps...

This sounds kind of similar to bug 772330 comment 22, where dmajor describes an AMD CPU errata.  The errata is in this doc:

http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf

The description says:

"Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly update the branch status when a taken branch occurs where the first or second instruction after the branch is an indirect call or jump. This may cause the processor to update the rIP (the instruction pointer register) after a not-taken branch that ends on the last byte of an aligned quad-word such that it appears the processor skips, and does not execute, one or more instructions. The new updated rIP due to this erratum may not be at an instruction boundary"

It's not a great matchup, but if these sort of crashes are *all* on AMD chips, the above might be plausible...
...and if I had read more closely, I would have seen that you referenced that exact bug and errata. =/
Random idea:

What if we had a system that allocated a few scattered MiB (i.e., not all in one contiguous run or always at the same address, though being careful not to unduly increase fragmentation) with a predictable bit-pattern and periodically (say, on the daily telemetry or update ping) the system scanned all those MiBs to ensure they still had the same bit-pattern.  If a bitflip was detected, we set a flag in the browser that gets included in crash reports and also persists between browser restarts (at least for a period of time).

This could help us confirm a correlation between these catch-all JIT/GC crashes and the corruption flag and also have separate bins so that spikes in non-corruption-correlated crashes get more attention.  If we wanted to get fancy, we could even pop up a notification to the user suggesting they have bad RAM if they had the corruption flag and they were experiencing crashes :)
This isn't a shutdown crash as far as I can tell.
No longer blocks: shutdownkill
Whiteboard: [native-crash], ShutDownKill → [native-crash]
Depends on: 772330
(In reply to Jan de Mooij [:jandem] from comment #44)
> Today I looked at about 80 crash dumps for EnterBaselineMethod crashes
> (Firefox 46.0, date >= 2016-05-01, uptime > 5000) and tried to group them.

Awesome work!

> The first instruction is the one where we crash
> (EXCEPTION_ACCESS_VIOLATION_READ or  EXCEPTION_ACCESS_VIOLATION_WRITE with a
> low address like 0x168). Yes, that makes no sense: this compare instruction
> does not access any memory.

Can we detect this cpu familly, and use the segfault handler to resume the execution?  In a similar way as operating system are emulating old instructions on newer generations of cpu.

> […] are caused by a single bit flip […]

Luke suggestion sounds interesting.  I recall people mentioning doing a memcheck as part of the safe-mode.

I do not know what cost this would have, but maybe this is something we can (randomly) do when we allocate new memory pages.
(In reply to Jan de Mooij [:jandem] from comment #44)
> (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD
> family 20 model 2 stepping 0" CPU. These crashes are all similar: we're
> executing the following Baseline type monitor stub:
...
> I don't see crashes in this code with any other CPU. It's not the first time
> this processor is causing trouble, see bug 772330 and also bug 1264188
> (although the latter is mostly model 1 and this is model 2). I wonder if
> this could be erratum 688 or a similar bug - Baseline stubs definitely use a
> lot of indirect jumps and calls.

My experience with the AMD crashes has been that some of them show up mostly on model 1, and others show up mostly on model 2.

> Not sure what we should do here - we could try to emit some NOPS between the
> jumps and see if that helps...

That could actually be very interesting, especially if you understand what the alignment conditions described in the erratum are referring to.  (alignment of what?)  It seems like it might be entirely possible to make these crashes go away for JIT-generated code by changing the alignment of jumps.

> These bit flips in ICEntry pointers are surprisingly common. We should
> probably add checks for this. Not sure what else we can do.

Yeah, I've seen a bunch of other crashes recently that were the result of bit flips in memory.  Though I'm curious if it's consistent with random memory being bad for such a high proportion of such crashes to end up at this particular spot.  Are there a large number of these pointers?   (For the bit flips being in JIT code... it seems more obvious to me that that's a big use of memory.)
Thank you for the detailed analysis, Jan.

The bitflips are scary. Jan, are you assuming that it's faulty hardware that's the cause? It sounds like others are assuming that but I can't tell if that's what you think.

I think running a memtest on certain circumstances is a great idea. How hard is it write a memtest? What circumstances would you run it under? Do we have a bug open for this idea?
I remember dolske was experimenting with running a memtest a few years ago; I don't know what happened there.  I don't have any bugs on file -- just an idea while reading Jan's very interesting analysis -- sorry, don't mean to derail to more targeted discussion here.
The mysterious category 1 could also be bit flips.  This is the machine code for the troublesome fragment:

   0:	83 f9 88             	cmp    $0xffffff88,%ecx
   3:	75 06                	jne    b <L>
   5:	39 57 10             	cmp    %edx,0x10(%edi)
   8:	75 01                	jne    b <L>
   a:	c3                   	ret    
<L>

I generated all possible one-bit flips of the first instruction.  One possibility stands out: if the first byte becomes A3, then the CPU sees

   0:	a3 f9 88 75 06       	mov    %eax,0x67588f9
   5:	39 57 10             	cmp    %edx,0x10(%edi)
   8:	75 01                	jne    b <L>
   a:	c3                   	ret    

which performs a write to memory at an address that's almost certainly inaccessible.  It's not a _low_ address, though.  It takes three bit flips to hit an instruction that could plausibly write to a low address:

   0:	89 79 88             	mov    %edi,-0x78(%ecx)
   3:	75 06                	jne    b <L>
   5:	39 57 10             	cmp    %edx,0x10(%edi)
   8:	75 01                	jne    b <L>
   a:	c3                   	ret    

Still, what with all the other cases seeming to be memory corruption, I would suggest that this is more probable than a CPU bug.
Another one-bit flip possibility that I missed earlier:

   0:	83 b9 88 75 06 39 57 	cmpl   $0x57,0x39067588(%ecx)
   7:	10 75 01             	adc    %dh,0x1(%ebp)
   a:	c3                   	ret    

That could hit a low address depending on what's in %ecx.  (What _is_ in %ecx?)
See Also: → 1270554
(In reply to Nicholas Nethercote [:njn] from comment #51)
> Thank you for the detailed analysis, Jan.
> 
> The bitflips are scary. Jan, are you assuming that it's faulty hardware
> that's the cause? It sounds like others are assuming that but I can't tell
> if that's what you think.
> 
> I think running a memtest on certain circumstances is a great idea. How hard
> is it write a memtest? What circumstances would you run it under? Do we have
> a bug open for this idea?

Bug 995652 is our memtest on crash bug. I've filed bug 1270554 to work on memtest in the running firefox process.
See Also: → 995652
Thanks for all comments. Replies below..

(In reply to Nicolas B. Pierron [:nbp] from comment #49)
> Can we detect this cpu familly, and use the segfault handler to resume the
> execution?  In a similar way as operating system are emulating old
> instructions on newer generations of cpu.

Interesting idea but it seems complicated, also because we don't really know the state of the CPU when it misbehaves.

(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) from comment #50)
> > Not sure what we should do here - we could try to emit some NOPS between the
> > jumps and see if that helps...
> 
> That could actually be very interesting, especially if you understand what
> the alignment conditions described in the erratum are referring to. 
> (alignment of what?)  It seems like it might be entirely possible to make
> these crashes go away for JIT-generated code by changing the alignment of
> jumps.

Yeah I think as a first step we could try to emit NOPS as part of this particular IC stub, and see if it makes these crashes go away.

> Though I'm curious if it's consistent with random
> memory being bad for such a high proportion of such crashes to end up at
> this particular spot.  Are there a large number of these pointers?

Yes, basically one for each interesting JS bytecode op. Also, we first emit the code and then at the end we write these pointers in it (once we know the values), so it's possible that write pattern happens to hit memory or cache lines in a way that makes it more error prone.

(In reply to Nicholas Nethercote [:njn] from comment #51)
> Jan, are you assuming that it's faulty hardware
> that's the cause? It sounds like others are assuming that but I can't tell
> if that's what you think.

I think so, yeah. In theory it could be another thread doing something like *bytePtr ^= 0x1, but that also seems unlikely. Our JIT code is usually non-writable so the window for this is pretty small. Also, on Twitter people from the Chrome/V8 teams said they've seen similar bitflips.

(In reply to Zack Weinberg (:zwol) from comment #53)
> The mysterious category 1 could also be bit flips.

That's not what I'm seeing in the memory dumps. Or do you mean a different kind of bitflip, somewhere in the CPU?

> Still, what with all the other cases seeming to be memory corruption, I
> would suggest that this is more probable than a CPU bug.

Also if it's (a) *only* this exact CPU, and (b) *always* this particular piece of JIT code and (c) this CPU is *known* to be buggy when it comes to (indirect) branches?

Comment 57

3 years ago
It might be easy and interesting to compute a checksum of each block of machine code, and then check it before entry. Checksums can be pretty fast.

Comment 58

3 years ago
... *especially* checksums that need to detect only a single bit changing, without correction. We could do giant SSE xors, 128 bits at a time, over the code. It'd be on the scale of a memset or memcpy operation.
(In reply to Jan de Mooij [:jandem] from comment #44)
> (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD
> family 20 model 2 stepping 0" CPU. These crashes are all similar: we're
> executing the following Baseline type monitor stub:
> 
>   cmp    $0xffffff88,%ecx
>   jne    L
>   cmp    %edx,0x10(%edi)
>   jne    L
>   ret    
>  L:
>   mov 0x4(%edi),%edi
>   jmp    *(%edi)
> 
> The first instruction is the one where we crash
> (EXCEPTION_ACCESS_VIOLATION_READ or  EXCEPTION_ACCESS_VIOLATION_WRITE with a
> low address like 0x168). Yes, that makes no sense: this compare instruction
> does not access any memory.

FWIW, this hasn't been a characteristic of the other crashes we've seen with bug 772330.  I believe for those, it's made sense how we would have crashed at the given instruction given that we ended up there in the state we were in.

> I don't see crashes in this code with any other CPU. It's not the first time
> this processor is causing trouble, see bug 772330 and also bug 1264188
> (although the latter is mostly model 1 and this is model 2). I wonder if
> this could be erratum 688 or a similar bug - Baseline stubs definitely use a
> lot of indirect jumps and calls.
> 
> Example crash: bp-c70d9601-a96e-442b-ac05-d0ab52160501

I looked at this one a little bit, and I really don't see how we crashed.  The JIT code was:

06EF5C50 83 F9 88             cmp         ecx,0FFFFFF88h  
06EF5C53 0F 85 0A 00 00 00    jne         06EF5C63  
06EF5C59 39 57 10             cmp         dword ptr [edi+10h],edx  
06EF5C5C 0F 85 01 00 00 00    jne         06EF5C63  
06EF5C62 C3                   ret

I wonder if there's a way to transform that into something that reads from EBX with a bit flip.  (I mention EBX because the crash address is 0x80, which is the value of EBX.)  (The closest I see is 8B 3B, which is four bit flips!)
(In reply to David Baron [:dbaron] ⌚️UTC-7 (review requests must explain patch) (busy May 9-13) from comment #59)
> I wonder if there's a way to transform that into something that reads from
> EBX with a bit flip.  (I mention EBX because the crash address is 0x80,
> which is the value of EBX.)

I looked at some other reports and the crash address is often the value in either EAX or EBX. Another erratum that might be relevant here:

> 578. Branch Prediction May Cause Incorrect Processor Behavior
> 
> Under a highly specific and detailed set of internal timing conditions involving
> multiple events occurring within a small window of time, the processor branch
> prediction logic may cause the processor core to decode incorrect instruction
> bytes.
> 
> Potential Effect on System
> 
> Unpredictable program behavior, generally leading to a program exception.

I think "decoding incorrect instruction" bytes fits these crashes really well. This issue has been fixed, it's possible to check CPUID bits to see if the processor has the fix.

Unfortunately there's not much information to go on so it's just guessing at this point.
For [@ js::jit::EnterBaselineMethod ], it's the #10 topcrash on release for 46.0.1.  It looks fairly high volume on 47 beta 4 and beta as well. It doesn't show up much in 49 or 48. Is there something new going on? Does this seem actionable at all?   

Till we know, I'll track this for 46 and 47.
Flags: needinfo?(jdemooij)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #61)
> For [@ js::jit::EnterBaselineMethod ], it's the #10 topcrash on release for
> 46.0.1.  It looks fairly high volume on 47 beta 4 and beta as well. It
> doesn't show up much in 49 or 48. Is there something new going on? Does this
> seem actionable at all?   
> 
> Till we know, I'll track this for 46 and 47.

I looked at this a bit and I don't think the beta crashes are very different from release.

There's at least 1 user (on XP SP2) who submitted a pretty large number of beta crash reports.. They don't look very actionable though, maybe malware or bad hardware.
Flags: needinfo?(jdemooij)
> (1) At least 15-20% of these crashes are with our notorious "AuthenticAMD
> family 20 model 2 stepping 0" CPU.
> [...]
> I don't see crashes in this code with any other CPU. It's not the first time
> this processor is causing trouble, see bug 772330 and also bug 1264188
> (although the latter is mostly model 1 and this is model 2). I wonder if
> this could be erratum 688 or a similar bug - Baseline stubs definitely use a
> lot of indirect jumps and calls.

I did some follow-up analysis on this.

TL;DR: The following CPU families have suspiciously high EnterBaselineMethod
crash rates. They are ranked from the most crashes to the least. 

> Cpu Info                                        Count
> AuthenticAMD family 16 model 6 stepping 3 | 2   18341
> AuthenticAMD family 20 model 2 stepping 0 | 2   13663
> AuthenticAMD family 22 model 0 stepping 1 | 2   6471
> AuthenticAMD family 21 model 19 stepping 1 | 2  5894
> AuthenticAMD family 16 model 6 stepping 3 | 1   233
> AuthenticAMD family 20 model 1 stepping 0 | 2   143
> AuthenticAMD family 6 model 8 stepping 1 | 1    102
> AuthenticAMD family 22 model 0 stepping 1 | 4   78

Note especially the many crashes outside of "family 20"! Perhaps AMD bugs are
more widespread than we thought? Jan, it might be worth looking at JIT crashes
in these other families.

----

Details:

I did a super search for all Firefox crashes in the past 7 days, faceted on the
"cpu info" field:

https://crash-stats.mozilla.com/search/?product=Firefox&_facets=signature&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info

I clicked on all 50 entries, opening a new tab for each one. Each tab thus held
all the Firefox crashes in the past 7 days for a single CPU family. I then went
through them all and manually extracted the rank and percentage for
EnterBaselineMethod crashes, resulting in the following table.

> Rank    Cpu info                                        Count   %       EnterBaselineMethod rank/%
> -       ALL FAMILIES COMBINED                                           #14 0.73 %
> 1       GenuineIntel family 6 model 23 stepping 10 | 2  137092  9.91 %  #10 0.92 %
> 2       GenuineIntel family 6 model 42 stepping 7 | 4   101213  7.32 %  #32 0.40 %
> 3       GenuineIntel family 6 model 58 stepping 9 | 4   98764   7.14 %  #42 0.30 %
> 4       GenuineIntel family 6 model 60 stepping 3 | 4   62630   4.53 %  (not in top 50)
> 5  *    GenuineIntel family 6 model 15 stepping 13 | 2  59451   4.30 %  #10 1.24 %
> 6       GenuineIntel family 6 model 42 stepping 7 | 2   47232   3.41 %  #24 0.47 %
> 7       GenuineIntel family 6 model 69 stepping 1 | 4   44410   3.21 %  (not in top 50)
> 8       GenuineIntel family 6 model 37 stepping 5 | 4   39198   2.83 %  #28 0.41 %
> 9       GenuineIntel family 6 model 58 stepping 9 | 2   34429   2.49 %  #24 0.46 %
> 10      ???                                             27544   1.99 %  ???
> 11      GenuineIntel family 6 model 61 stepping 4 | 4   25239   1.82 %  (not in top 50)
> 12      family 6 model 69 stepping 1 | 4                24855   1.80 %  #27 0.15 %
> 13      GenuineIntel family 6 model 60 stepping 3 | 8   23132   1.67 %  (not in top 50)
> 14      GenuineIntel family 6 model 23 stepping 6 | 2   20895   1.51 %  #15 0.75 %
> 15 ***  AuthenticAMD family 16 model 6 stepping 3 | 2   18341   1.33 %  #3  3.79 %
> 16      family 6 model 58 stepping 9 | 4                17604   1.27 %  #16 0.66 %
> 17      family 6 model 42 stepping 7 | 4                17084   1.23 %  #38 0.36 %
> 18      GenuineIntel family 6 model 37 stepping 2 | 4   16006   1.16 %  #28 0.46 %
> 19      GenuineIntel family 6 model 58 stepping 9 | 8   15807   1.14 %  #41 0.33 %
> 20      GenuineIntel family 6 model 60 stepping 3 | 2   14002   1.01 %  #30 0.36 %
> 21 ***  AuthenticAMD family 20 model 2 stepping 0 | 2   13663   0.99 %  #1  4.50 %
> 22 *    GenuineIntel family 6 model 15 stepping 11 | 2  13360   0.97 %  #9  1.16 %
> 23      family 6 model 23 stepping 10 | 2               13193   0.95 %  #17 0.75 %
> 24 *    AuthenticAMD family 16 model 6 stepping 2 | 2   12956   0.94 %  #4  1.52 %
> 25      GenuineIntel family 6 model 42 stepping 7 | 8   12083   0.87 %  #21 0.53 %
> 26      GenuineIntel family 6 model 15 stepping 2 | 2   10930   0.79 %  #13 0.91 %
> 27 *    GenuineIntel family 15 model 6 stepping 5 | 2   9988    0.72 %  #7  1.36 %
> 28      GenuineIntel family 6 model 15 stepping 6 | 2   9380    0.68 %  #13 0.81 %
> 29      GenuineIntel family 6 model 55 stepping 8 | 2   9241    0.67 %  (not in top 50)
> 30      GenuineIntel family 6 model 55 stepping 8 | 4   8767    0.63 %  (not in top 50)
> 31      family 6 model 58 stepping 9 | 8                8690    0.63 %  #28 0.41 %
> 32 *    GenuineIntel family 15 model 4 stepping 3 | 2   8179    0.59 %  #11 1.19 %
> 33 *    AuthenticAMD family 15 model 107 stepping 2 | 2 8130    0.59 %  #11 1.02 %
> 34 *    GenuineIntel family 6 model 22 stepping 1 | 1   7811    0.56 %  #9  1.13 %
> 35      GenuineIntel family 6 model 37 stepping 5 | 2   7714    0.56 %  (not in top 50)
> 36      GenuineIntel family 6 model 23 stepping 10 | 4  7568    0.55 %  #12 0.78 %
> 37 *    GenuineIntel family 15 model 4 stepping 1 | 2   6837    0.49 %  #8  1.23 %
> 38 *    GenuineIntel family 15 model 2 stepping 9 | 1   6692    0.48 %  #8  1.39 %
> 39 ***  AuthenticAMD family 22 model 0 stepping 1 | 2   6471    0.47 %  #5  4.42 %
> 40      family 6 model 70 stepping 1 | 8                6385    0.46 %  #39 0.33 %
> 41 *    GenuineIntel family 15 model 4 stepping 9 | 2   6176    0.45 %  #86 1.39 %
> 42      family 6 model 37 stepping 5 | 4                6064    0.44 %  #27 0.46 %
> 43 *    GenuineIntel family 15 model 4 stepping 1 | 1   5970    0.43 %  #7  1.57 %
> 44 ***  AuthenticAMD family 21 model 19 stepping 1 | 2  5894    0.43 %  #3  3.89 %
> 45      GenuineIntel family 6 model 28 stepping 10 | 2  5522    0.40 %  #23 0.42 %
> 46 *    AuthenticAMD family 18 model 1 stepping 0 | 2   5491    0.40 %  #5  1.60 %
> 47      GenuineIntel family 6 model 78 stepping 3 | 4   5489    0.40 %  (not in top 50)
> 48      family 6 model 42 stepping 7 | 8                5387    0.39 %  #27 0.48 %
> 49      GenuineIntel family 6 model 45 stepping 7 | 4   5368    0.39 %  (not in top 50)
> 50 *    AuthenticAMD family 21 model 16 stepping 1 | 2  5353    0.39 %  #9  1.20 %

Over all CPU families, EnterBaselineMethod crashes were 0.73% of all crashes.
Looking at individual CPU families, four of them stood out as having
EnterBaselineMethod crash rates in the range 3.79--4.50%. These are marked with
'***'. I also marked ones with an EnterBaselineMethod crash rate greater than
1% with '*', but those could just be natural variation.

I then searched for all EnterBaselineMethod crashes in Firefox in the past 7
days, faceted by Cpu Info:

https://crash-stats.mozilla.com/search/?product=Firefox&signature=%3Djs%3A%3Ajit%3A%3AEnterBaselineMethod&_facets=signature&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info

I then cross-correlated these ranks with the ranks in the table above, giving this table:

>                                                                             rank in table 1
> 1       GenuineIntel family 6 model 23 stepping 10 | 2  1262    12.56 %     #1 
> 2       GenuineIntel family 6 model 15 stepping 13 | 2  738     7.34 %      #5
> 3       AuthenticAMD family 16 model 6 stepping 3 | 2   695     6.92 %      #15 ***
> 4       AuthenticAMD family 20 model 2 stepping 0 | 2   619     6.16 %      #21 ***
> 5       GenuineIntel family 6 model 42 stepping 7 | 4   404     4.02 %      #2
> 6       GenuineIntel family 6 model 58 stepping 9 | 4   292     2.91 %      #3
> 7       AuthenticAMD family 22 model 0 stepping 1 | 2   285     2.84 %      #39 ***
> 8       AuthenticAMD family 16 model 6 stepping 3 | 1   233     2.32 %      N/A ***
> 9       AuthenticAMD family 21 model 19 stepping 1 | 2  228     2.27 %      #44 ***
> 10      GenuineIntel family 6 model 42 stepping 7 | 2   220     2.19 %      #6
> 11      AuthenticAMD family 16 model 6 stepping 2 | 2   196     1.95 %      #24 *
> 12      GenuineIntel family 6 model 37 stepping 5 | 4   162     1.61 %      #8
> 13      GenuineIntel family 6 model 58 stepping 9 | 2   159     1.58 %      #9
> 14      GenuineIntel family 6 model 23 stepping 6 | 2   157     1.56 %      #14
> 15      GenuineIntel family 6 model 15 stepping 11 | 2  155     1.54 %      #22
> 16      AuthenticAMD family 20 model 1 stepping 0 | 2   143     1.42 %      N/A ***
> 17      GenuineIntel family 15 model 6 stepping 5 | 2   135     1.34 %      #27
> 18      GenuineIntel family 6 model 60 stepping 3 | 4   118     1.17 %      #4
> 19      AuthenticAMD family 6 model 8 stepping 1 | 1    103     1.02 %      N/A ***
> 20      GenuineIntel family 6 model 15 stepping 2 | 2   99      0.99 %      #26
> 21      GenuineIntel family 15 model 4 stepping 3 | 2   95      0.95 %      #32 *
> 22      GenuineIntel family 15 model 4 stepping 1 | 1   94      0.94 %      #43 *
> 23      GenuineIntel family 15 model 2 stepping 9 | 1   93      0.93 %      #38 *
> 24      AuthenticAMD family 18 model 1 stepping 0 | 2   90      0.90 %      #46 *
> 25      GenuineIntel family 6 model 22 stepping 1 | 1   89      0.89 %      #34 *
> 26      GenuineIntel family 15 model 4 stepping 9 | 2   85      0.85 %      #41 *
> 27      GenuineIntel family 15 model 4 stepping 1 | 2   84      0.84 %      #37 *
> 28      AuthenticAMD family 15 model 107 stepping 2 | 2 83      0.83 %      #33
> 29      AuthenticAMD family 22 model 0 stepping 1 | 4   78      0.78 %      N/A **
> 30      GenuineIntel family 6 model 15 stepping 6 | 2   76      0.76 %      #28
> 31      AuthenticAMD family 6 model 10 stepping 0 | 1   75      0.75 %      N/A **
> 32      GenuineIntel family 6 model 37 stepping 2 | 4   72      0.72 %      #18
> 33      GenuineIntel family 6 model 69 stepping 1 | 4   71      0.71 %      #7
> 34      GenuineIntel family 15 model 6 stepping 5 | 1   67      0.67 %      N/A *
> 35      AuthenticAMD family 21 model 16 stepping 1 | 2  64      0.64 %      #50 *
> 36      GenuineIntel family 6 model 42 stepping 7 | 8   64      0.64 %      #25
> 37      AuthenticAMD family 16 model 5 stepping 3 | 3   58      0.58 %      N/A *
> 38      AuthenticAMD family 21 model 16 stepping 1 | 4  58      0.58 %      N/A *
> 39      GenuineIntel family 6 model 23 stepping 10 | 4  57      0.57 %      #36
> 40      AuthenticAMD family 16 model 6 stepping 2 | 1   55      0.55 %      N/A *
> 41      AuthenticAMD family 21 model 48 stepping 1 | 4  55      0.55 %      N/A *
> 42      GenuineIntel family 15 model 4 stepping 9 | 1   55      0.55 %      N/A
> 43      GenuineIntel family 6 model 58 stepping 9 | 8   53      0.53 %      #19
> 44      GenuineIntel family 6 model 60 stepping 3 | 2   50      0.50 %      #20
> 45      AuthenticAMD family 18 model 1 stepping 0 | 4   49      0.49 %      N/A
> 46      GenuineIntel family 6 model 60 stepping 3 | 8   47      0.47 %      #13
> 47      AuthenticAMD family 16 model 4 stepping 3 | 4   46      0.46 %      N/A
> 48      AuthenticAMD family 15 model 75 stepping 2 | 2  43      0.43 %      N/A
> 49      AuthenticAMD family 16 model 5 stepping 3 | 4   43      0.43 %      N/A
> 50      GenuineIntel family 15 model 2 stepping 7 | 1   41      0.41 %      N/A

The relative position of a CPU family in the two tables indicates its crash rate.
For example, the first entry is unsurprising -- "GenuineIntel family 6 model 23
stepping 10" is the #1 family with an EnterBaselineMethod, but it's also the #1
CPU overall.

But entries #3 and #4 in this table had much lower rankings in the first table,
which suggests they have unusually high EnterBaselineMethod crash rates, and
indeed they were two of the previously-identified suspicious ones.

There are also some entries that show up reasonably high in this table, but
didn't show up at all in the previous table. So I looked them up and this gave
us a few more entries that could be added to the first table:

> ??      AuthenticAMD family 16 model 6 stepping 3 | 1   233     ?.?? %  #?  5.34 %
> ??      AuthenticAMD family 20 model 1 stepping 0 | 2   143     ?.?? %  #?  3.63 %
> ??      AuthenticAMD family 6 model 8 stepping 1 | 1    102     ?.?? %  #?  2.66 %
> ??      AuthenticAMD family 22 model 0 stepping 1 | 4   78      ?.?? %  #?  1.68 %

The first entry here, despite having a low number of crashes -- it must just be
an uncommon CPU family -- had an even higher EnterBaselineMethod crash rate of 5.34%.

This analysis isn't perfect because other crash signatures may also have
correlations against CPU family. Ideally we'd match the EnterBaselineMethod
crash rates for each CPU family against the CPU family usage among our user
population, perhaps from telemetry data.
Depends on: amdbug
Crash volume for signature 'js::jit::EnterBaselineMethod':
 - nightly (version 50): 3 crashes from 2016-06-06.
 - aurora  (version 49): 6 crashes from 2016-06-07.
 - esr     (version 45): 1324 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          2          0          0          0          0          1
 - aurora           3          1          0          0          1          1          0
 - esr            197        161        157        142        176        155         89

Affected platforms: Windows, Mac OS X, Linux
hmm, if this is the amd bug, that means we had a esr version being impacted...
Based on comment 44, I would expect us to still have a non-zero baseline of crashes, especially on old hardware.  I guess the likelyhood of using release / esr version might be higher on old hardware.

Bug 1281759 only landed in Gecko 50, so this should not have changed aurora.

Could this be a problem with the crash reporter when we have no stack frame at the top?  Or maybe we discard these reports?  Or they are classified with a bunch of different signature?
See Also: → 1290419
Crash volume for signature 'js::jit::EnterBaselineMethod':
 - nightly (version 51): 1 crash from 2016-08-01.
 - aurora  (version 50): 1 crash from 2016-08-01.
 - beta    (version 49): 77 crashes from 2016-08-02.
 - release (version 48): 6134 crashes from 2016-07-25.
 - esr     (version 45): 1674 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       0       0
 - aurora        1       0       0
 - beta         28      22       9
 - release    1970    1820    1015
 - esr          68      47     121

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly #730
 - aurora
 - beta    #589      #434
 - release #9        #4
 - esr     #156
See Also: → 1293996
Crash volume for signature 'js::jit::EnterBaselineMethod':
 - nightly (version 52): 2 crashes from 2016-09-19.
 - aurora  (version 51): 0 crashes from 2016-09-19.
 - beta    (version 50): 41 crashes from 2016-09-20.
 - release (version 49): 94 crashes from 2016-09-05.
 - esr     (version 45): 1728 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly       0       2
 - aurora        0       0
 - beta         34       7
 - release      67      27
 - esr         200     186

Affected platforms: Windows, Linux

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly
 - aurora
 - beta    #440      #631
 - release #975      #474
 - esr     #57
Priority: -- → P3
Mass wontfix for bugs affecting firefox 52.
Crash Signature: [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)] [@ js::jit::EnterBaselineMethod] → [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)] [@ js::jit::EnterBaselineMethod] [@ EnterJit]
Duplicate of this bug: 1408766
Adding this to our crash triage list.
Assignee: jdemooij → nobody
Whiteboard: [native-crash] → [native-crash][#jsapi:crashes-retriage]
Closing in favor of meta-bug Bug 858032. Current investigations branch off there.
Blocks: SadJit
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → INCOMPLETE
Whiteboard: [native-crash][#jsapi:crashes-retriage] → [native-crash]

Comment 73

a year ago
If this was due (or at least partially) to bug 1281759 (I don't know, might or might not be, updated stats wouldn't hurt I guess), then I'm not sure how smart it could be to refer to a meta issue.
The signature encompasses a number of reasons. There have also been numerous renames of JIT signatures which has added to confusion. The meta-bug should be referring to Bug 1281759 as one of the source of crashes.
You need to log in before you can comment on or make changes to this bug.