Work around mysterious AMD JIT crashes

RESOLVED FIXED in Firefox 49

Status

()

Core
JavaScript Engine: JIT
P3
normal
RESOLVED FIXED
a year ago
23 days ago

People

(Reporter: jandem, Assigned: jandem)

Tracking

(Depends on: 1 bug, Blocks: 3 bugs)

unspecified
mozilla50
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(platform-rel -, firefox49 fixed, firefox50 fixed, firefox51 fixed, firefox52 fixed)

Details

(Whiteboard: [platform-rel-AMD])

Attachments

(2 attachments)

(Assignee)

Description

a year ago
See the AMD crashes in bug 1034706 comment 44.

This still happens on 47. I want to add some nops to the Baseline stub where we're crashing (we can probably use cpuid to restrict it to these AMD CPUs) to see if that makes it go away. Info from AMD suggests it might help.
(Assignee)

Updated

a year ago
No longer blocks: 1279590
(Assignee)

Updated

a year ago
Blocks: 1279590
(Assignee)

Comment 1

a year ago
Created attachment 8764587 [details] [diff] [review]
Patch

This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and model is 0-2.

According to AMD engineers, limiting the number of branches per cache line might help, so I'm hopeful this will work. If this patch doesn't help, we can try more or larger NOPs...
Attachment #8764587 - Flags: review?(sunfish)
Comment on attachment 8764587 [details] [diff] [review]
Patch

Review of attachment 8764587 [details] [diff] [review]:
-----------------------------------------------------------------

Yikes. Patch looks good though.
Attachment #8764587 - Flags: review?(sunfish) → review+

Comment 3

a year ago
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ae024662143f
Try to work around mysterious AMD crashes. r=sunfish

Comment 4

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/ae024662143f
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox50: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
(Assignee)

Updated

a year ago
Flags: needinfo?(jdemooij)
> This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and
> model is 0-2.

Do we have confirmation from AMD that these models are the only ones affected? My analysis in bug 1034706 comment 63 suggests that these families account for only a fraction of suspicious JIT crashes on AMD CPUs.
Also, do you have an analysis plan to determine if this change helps in practice?
(Assignee)

Comment 7

a year ago
(In reply to Nicholas Nethercote [:njn] from comment #5)
> Do we have confirmation from AMD that these models are the only ones
> affected? My analysis in bug 1034706 comment 63 suggests that these families
> account for only a fraction of suspicious JIT crashes on AMD CPUs.

No confirmation, but I haven't seen *these* crashes with other (AMD) CPUs. I can take another look at some of the other crashes though.

(In reply to Nicholas Nethercote [:njn] from comment #6)
> Also, do you have an analysis plan to determine if this change helps in
> practice?

I'll check if there's any difference on Nightly later today. Unfortunately not that many users are on Nightly and most probably don't use these CPUs, so we'll likely have to backport to see if it helps...
Blocks: 1289666
This might fix bug 1290388.
Blocks: 1290388
See Also: → bug 1290419
(Assignee)

Comment 9

a year ago
(In reply to Marco Castelluccio [:marco] (PTO until August 9) from comment #8)
> This might fix bug 1290388.

I don't think so; I moved it to the See Also list :)
No longer blocks: 1290388
Flags: needinfo?(jdemooij)
See Also: → bug 1290388
(Assignee)

Comment 10

a year ago
Comment on attachment 8764587 [details] [diff] [review]
Patch

This patch attempts to work around JIT code crashes on certain AMD CPUs. It inserts some NOP instructions on this particular CPU. It should be safe to backport.

Approval Request Comment
[Feature/regressing bug #]: -
[User impact if declined]: Potential crashes with certain AMD CPUs
[Describe test coverage new/current, TreeHerder]: On m-c and aurora for a while. I haven't seen these crashes on builds with this patch applied, but it's hard to say much because the crashes don't seem to affect all builds.
[Risks and why]: Low risk.
[String/UUID change made/needed]: None.
Attachment #8764587 - Flags: approval-mozilla-beta?
Comment on attachment 8764587 [details] [diff] [review]
Patch

Let's take it to try to fix the famous issue.
Should be in 49 beta 3
Attachment #8764587 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 12

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/7af8faeaf115
status-firefox49: --- → fixed
Looks like we have some new crashes on beta 3 described in bug 1294963. If you think it's from this, can you back out your changes or submit a fix? Thanks!
Flags: needinfo?(jdemooij)
On 2nd look, this does seem to have caused the #1 topcrash on beta 3.  Wes, can you back it out before the beta 5 build?
Depends on: 1294963
(Assignee)

Comment 15

a year ago
I'll take a closer look, but it's *extremely unlikely* this caused these crashes.

What makes us think this is related to bug 1294963? The AMD bug shows up in some builds but not others. Just because this patch fixes one instance of this bug by inserting NOP instructions, doesn't mean it caused a random bug in entirely unrelated code...
(Assignee)

Comment 16

a year ago
This workaround probably didn't work. I still see similar crashes like this one:

https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7-e96e22160815

We did correctly identify the CPU and inserted NOP instructions, but apparently that was not enough to stop the crashes. I'll check the errata and conversations with AMD tomorrow to see if there's something I missed... I'll also check if there's any difference in frequency.

006a4530 83 f9 8c              cmp    ecx,0xffffff8c   <- crash here
006a4533 0f 85 16 00 00 00     jne    0x006a454f
006a4539 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
006a453d 39 57 10              cmp    DWORD PTR [edi+0x10],edx
006a4540 0f 85 09 00 00 00     jne    0x006a454f
006a4546 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
006a454a c3                    ret    
006a454b 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
006a454f 8b 7f 04              mov    edi,DWORD PTR [edi+0x4]
006a4552 ff 27                 jmp    DWORD PTR [edi]
Flags: needinfo?(jdemooij)
(Assignee)

Updated

a year ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Priority: -- → P3
(In reply to Jan de Mooij [:jandem] from comment #16)
> This workaround probably didn't work. I still see similar crashes like this
> one:
> 
> https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7-
> e96e22160815
> 
> We did correctly identify the CPU and inserted NOP instructions, but
> apparently that was not enough to stop the crashes. I'll check the errata
> and conversations with AMD tomorrow to see if there's something I missed...
> I'll also check if there's any difference in frequency.
> 
> 006a4530 83 f9 8c              cmp    ecx,0xffffff8c   <- crash here
> 006a4533 0f 85 16 00 00 00     jne    0x006a454f
> 006a4539 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
> 006a453d 39 57 10              cmp    DWORD PTR [edi+0x10],edx
> 006a4540 0f 85 09 00 00 00     jne    0x006a454f
> 006a4546 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
> 006a454a c3                    ret    
> 006a454b 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
> 006a454f 8b 7f 04              mov    edi,DWORD PTR [edi+0x4]
> 006a4552 ff 27                 jmp    DWORD PTR [edi]

Bug 772330 comment 55 mentions that the two "next instructions" could be affected. Should we try two nops instead of one long nop?
platform-rel: --- → ?
Whiteboard: [platform-rel-AMD]
platform-rel: ? → -
Created attachment 8805497 [details] [diff] [review]
Increasing NOP size

Talking to AMD, they suggested to maybe increase to 32 bytes of NOPs. Next try to squash those bugs.
Attachment #8805497 - Flags: review?(jdemooij)
(Assignee)

Comment 19

a year ago
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

Review of attachment 8805497 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, let's see if this works :)
Attachment #8805497 - Flags: review?(jdemooij) → review+

Comment 20

a year ago
Pushed by hv1989@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d77c9965e9b2
Try to work around mysterious AMD crashes (take 2). r=jandem

Comment 21

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/d77c9965e9b2
Status: REOPENED → RESOLVED
Last Resolved: a year agoa year ago
status-firefox52: --- → fixed
Resolution: --- → FIXED
See Also: → bug 1312270
See Also: → bug 1176268
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

Approval Request Comment

[Feature/regressing bug #]:
/

[User impact if declined]:
This contains a possible fix for AMD Bobcat processors. We are not sure this is a definite fix, but we only see numbers high enough on beta to come to a conclusion if this fix works or doesn't work.

[Describe test coverage new/current, TreeHerder]:
Has been on nightly since last week. I also enabled this workaround on my own computer and this caused no problems.

[Risks and why]: 
We are improving a work-around for AMD bobcat processors only. We already had a work-around uplifted, but that didn't seem to fix the crashes. Hopefully this new patch does that. We increased the NOP size from 4-bytes to 32-bytes on specific positions. We don't expect any regressions or issues even if the work-around doesn't work.

[String/UUID change made/needed]:
/
Attachment #8805497 - Flags: approval-mozilla-beta?
Attachment #8805497 - Flags: approval-mozilla-aurora?
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

Let's stabilize this on Aurora51 before deciding whether to uplift to Beta50.
Attachment #8805497 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Updated

a year ago
status-firefox51: --- → affected

Comment 24

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/30be0417e259
status-firefox51: affected → fixed
Hello Jan, Hannes, I don't think this is a release blocker. We are in code freeze mode and only taking fixes for critical new regressions, crashes, security issues. I would like to wontfix this for 50 unless you can provide me with data that this is really valuable based on Nightly52/Aurora51 data. Thanks!
Flags: needinfo?(jdemooij)
Flags: needinfo?(hv1989)

Updated

a year ago
status-firefox50: fixed → wontfix

Updated

a year ago
Attachment #8805497 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Hello DBaron, this is a late uplift in 50 RC. Do you think this is something worth taking a risk on and uplifting to 50 RC2? Or are we better off letting this stabilize on Beta51 (after merge day) for a few weeks and uplifting to planned 50.1.0 release? Appreciate your help.
Flags: needinfo?(dbaron)
The most important thing is to have this on Beta as soon as possible. Since we can only decide if this workaround fixed the issue or not when it hits Beta. We don't have enough people with this combination of hardware to meaningful trigger this on aurora/nightly.

As soon as this is on Beta we have to wait 2-3 weeks before we have insightful information about it.

Now the most important part now is getting information to drive the next decision.
1) This fixes the crash. In that case we should try to get this in a stable X.1 release.
2) This doesn't fix the crash. We will go back to drawing board to find another workaround.

Currently it doesn't make sense for us to force an uplift for a 50.1.0 release without having crash information on Beta.
As a result we should OR takes this on Beta50 OR wait 2 weeks and automatically have this on Beta51.
Flags: needinfo?(jdemooij)
Flags: needinfo?(hv1989)
As a side-note:

The workaround only triggers for "AMD Ontario" processors and is very benign. As a result even if the work-around doesn't work, we will not back this patch out. But let the removal of the work-around just ride the train.
I don't have a strong opinion here.  I'm generally pretty conservative on uplifts to RC, but if the patch is believed to be very safe I think it's worth considering, given that this was a significant number of crashes.
Flags: needinfo?(dbaron)
Depends on: 1315196
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

I have given it some thought. I will take a leap of faith and take this uplift in 50 RC2. If things look worse, we back it out and build RC3. If things look same/better, we ship with it. If things look worse on release, we do a dot release with a better fix or back out this one. Let's hope this one sticks. Beta50+
Attachment #8805497 - Flags: approval-mozilla-beta- → approval-mozilla-beta+

Updated

a year ago
status-firefox50: wontfix → affected
https://hg.mozilla.org/releases/mozilla-release/rev/ec795eb6be0099003f06c17c5afb6e08c84d392f
status-firefox50: affected → fixed

Updated

a year ago
See Also: → bug 1316022

Comment 32

a year ago
hi, rc2 still shows a particular crash pattern with some js related signatures (bug 1316022) that are correlated to the amd cpus: http://bit.ly/2fduzlh
Flags: needinfo?(jdemooij)
(In reply to [:philipp] from comment #32)
> hi, rc2 still shows a particular crash pattern with some js related
> signatures (bug 1316022) that are correlated to the amd cpus:
> http://bit.ly/2fduzlh

This bug is about the code generated by SpiderMonkey JIT compilers (as written in the bug title).

Bug 1316022 is dealing with stack trace of code generated by MSVC.  If you got a report with no stack s trace then this might be related to this bug.
Flags: needinfo?(jdemooij)
(In reply to Nicolas B. Pierron [:nbp] from comment #33)
> (In reply to [:philipp] from comment #32)
> > hi, rc2 still shows a particular crash pattern with some js related
> > signatures (bug 1316022) that are correlated to the amd cpus:
> > http://bit.ly/2fduzlh
> 
> This bug is about the code generated by SpiderMonkey JIT compilers (as
> written in the bug title).
> 
> Bug 1316022 is dealing with stack trace of code generated by MSVC.  If you
> got a report with no stack s trace then this might be related to this bug.

What do you mean by 'report with no stack trace'? Do you mean stack traces that contain code generated by the JIT?

If that's the case, we might have a few reports: https://crash-stats.mozilla.com/search/?build_id=20161104212021&signature=%3Djs%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3Djs%3A%3ANativeObject%3A%3AputProperty&signature=%3DAddOrChangeProperty&signature=%3DnsCOMPtr_base%3A%3A~nsCOMPtr_base%20%7C%20mozilla%3A%3Adom%3A%3AWriteOp%3A%3AInit&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AputProperty&signature=%3Djs%3A%3Ajit%3A%3ACallInfo%3A%3AsetImplicitlyUsedUnchecked&signature=~EnterIon&proto_signature=~EnterBaseline&proto_signature=~EnterBaselineMethod&product=Firefox&process_type=browser&date=%3E%3D2016-11-05T00%3A00%3A00.000Z&date=%3C2016-11-09T13%3A45%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_info&_facets=proto_signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature

I've limited the search to stack traces that contain EnterBaseline, EnterBaselineMethod or EnterIon.

Could you check if one of those fits the conditions?
(In reply to Marco Castelluccio [:marco] from comment #34)
> (In reply to Nicolas B. Pierron [:nbp] from comment #33)
> > (In reply to [:philipp] from comment #32)
> > > hi, rc2 still shows a particular crash pattern with some js related
> > > signatures (bug 1316022) that are correlated to the amd cpus:
> > > http://bit.ly/2fduzlh
> > 
> > This bug is about the code generated by SpiderMonkey JIT compilers (as
> > written in the bug title).
> > 
> > Bug 1316022 is dealing with stack trace of code generated by MSVC.  If you
> > got a report with no stack s trace then this might be related to this bug.
> 
> What do you mean by 'report with no stack trace'? Do you mean stack traces
> that contain code generated by the JIT?

Yes, this means that the stack frames do not have much symbols, and the next valid frame should be something such as EnterIon / EnterBaseline.

> If that's the case, we might have a few reports: […]
> 
> I've limited the search to stack traces that contain EnterBaseline,
> EnterBaselineMethod or EnterIon.
> 
> Could you check if one of those fits the conditions?

I will try to explain the reasoning I have on one example:

  @0x0                                    ,  This seems to appear when some valid stacks exists
  js::NativeObject::addPropertyInternal   |  below them.  Thus these might be called by the JIT but
  xpc::JSXrayTraits::getPrototype         '  not caused by it.
  js::Shape::search<T>                    
  js::AddTypePropertyId                   
  AddOrChangeProperty                     
  js::SetPropertyByDefining               
  mozglue.dll@0x44af                      ,
  @0xffffff8b                             |  These stack frames are in complete random-ish
  js::jit::SetPropertyIC::update          |  order, and this might not correspond
  @0x2e70d7                               |  to any valid call path which can be produced
  @0xffffff84                             |  by either our code-base, or by code generated
  intrinsic_OwnPropertyKeys               |  by the JIT compilers.
  @0x75d212f                              |
  @0x2516585f                             |  The most likely source of this non-sense are
  @0x168db13                              |  the remains of previous stack frames pushed
  @0x24b676ff                             |  on the stack.
  @0x7310942                              '
  EnterBaseline                           ,
  js::jit::EnterBaselineMethod            |
  Interpret                               |
  js::RunScript                           | Valid call stack for entering JIT code.
  js::InternalCallOrConstruct             |
  js::jit::InvokeFunction                 |
  js::jit::TryAttachCallStub              '

Thus, from my opinion, this precise stack frame does not seems to be a bug in our JIT compiler.  If we were looking for a bug in our JIT compiler I would expect the stack frames to look something like:

  @0x75d212f                              '
  js::jit::SetPropertyIC::update          |  Random mess from the stack, even if some of the mess
  @0x168db13                              |  can be attributed a valid name.
  @0x24b676ff                             |
  @0x7310942                              '
  EnterBaseline                           ,
  js::jit::EnterBaselineMethod            |
  Interpret                               |
  js::RunScript                           | Valid call stack for entering JIT code.
  js::InternalCallOrConstruct             |
  js::jit::InvokeFunction                 |
  js::jit::TryAttachCallStub              '
Some of the reports with signature "jit | CORRUPT_CODE" on 50.1.0 seem to be failing in JIT compiled code, e.g. bp-a884f584-e3df-4167-84e9-4adf82161220:
@0x71abc33 	
@0x75087de 	
@0x406ecdf7 	
@0x3ce20942 	
EnterBaseline
js::jit::EnterBaselineMethod(JSContext*, js::RunState&)
js::RunScript(JSContext*, js::RunState&)
js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)
InternalCall
@0xffffff8b

(We have a spike of crashes related to the usual AMD CPUs in 50.1.0, some of them aren't happening in JIT code, but some of them might be. The bug is bug 1287087).
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(hv1989)
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(hv1989)
I think this bug is responsible for these crashes in 57beta: https://crash-stats.mozilla.org/signature/?product=Firefox&version=57.0b3&address=%3D0xffffffffffffffff&signature=EnterBaseline&date=%3E%3D2017-09-20T22%3A19%3A05.000Z&date=%3C2017-09-27T22%3A19%3A05.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports

I see nonsense memory violations at addresses without memory access. They happen in stubs with the work-around added by this patch, but happen at beginning of stub before the NOPs.
You need to log in before you can comment on or make changes to this bug.