Bug 1281759 (amdbug)

Work around mysterious AMD JIT crashes

REOPENED
Assigned to

Status

()

P3
normal
REOPENED
2 years ago
11 months ago

People

(Reporter: jandem, Assigned: tcampbell)

Tracking

(Depends on: 1 bug, Blocks: 3 bugs, {crash})

unspecified
mozilla50
crash
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(platform-rel -, firefox49 fixed, firefox50 fixed, firefox51 fixed, firefox52 fixed, firefox57 wontfix, firefox58 fix-optional, firefox59 affected)

Details

(Whiteboard: [platform-rel-AMD])

Attachments

(2 attachments)

(Reporter)

Description

2 years ago
See the AMD crashes in bug 1034706 comment 44.

This still happens on 47. I want to add some nops to the Baseline stub where we're crashing (we can probably use cpuid to restrict it to these AMD CPUs) to see if that makes it go away. Info from AMD suggests it might help.
(Reporter)

Updated

2 years ago
No longer blocks: 1279590
(Reporter)

Updated

2 years ago
Blocks: 1279590
(Reporter)

Comment 1

2 years ago
Created attachment 8764587 [details] [diff] [review]
Patch

This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and model is 0-2.

According to AMD engineers, limiting the number of branches per cache line might help, so I'm hopeful this will work. If this patch doesn't help, we can try more or larger NOPs...
Attachment #8764587 - Flags: review?(sunfish)
Comment on attachment 8764587 [details] [diff] [review]
Patch

Review of attachment 8764587 [details] [diff] [review]:
-----------------------------------------------------------------

Yikes. Patch looks good though.
Attachment #8764587 - Flags: review?(sunfish) → review+

Comment 3

2 years ago
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ae024662143f
Try to work around mysterious AMD crashes. r=sunfish

Comment 4

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/ae024662143f
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox50: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
(Reporter)

Updated

2 years ago
Flags: needinfo?(jdemooij)
> This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and
> model is 0-2.

Do we have confirmation from AMD that these models are the only ones affected? My analysis in bug 1034706 comment 63 suggests that these families account for only a fraction of suspicious JIT crashes on AMD CPUs.
Also, do you have an analysis plan to determine if this change helps in practice?
(Reporter)

Comment 7

2 years ago
(In reply to Nicholas Nethercote [:njn] from comment #5)
> Do we have confirmation from AMD that these models are the only ones
> affected? My analysis in bug 1034706 comment 63 suggests that these families
> account for only a fraction of suspicious JIT crashes on AMD CPUs.

No confirmation, but I haven't seen *these* crashes with other (AMD) CPUs. I can take another look at some of the other crashes though.

(In reply to Nicholas Nethercote [:njn] from comment #6)
> Also, do you have an analysis plan to determine if this change helps in
> practice?

I'll check if there's any difference on Nightly later today. Unfortunately not that many users are on Nightly and most probably don't use these CPUs, so we'll likely have to backport to see if it helps...
Blocks: 1289666
This might fix bug 1290388.
Blocks: 1290388
See Also: → bug 1290419
(Reporter)

Comment 9

2 years ago
(In reply to Marco Castelluccio [:marco] (PTO until August 9) from comment #8)
> This might fix bug 1290388.

I don't think so; I moved it to the See Also list :)
No longer blocks: 1290388
Flags: needinfo?(jdemooij)
See Also: → bug 1290388
(Reporter)

Comment 10

2 years ago
Comment on attachment 8764587 [details] [diff] [review]
Patch

This patch attempts to work around JIT code crashes on certain AMD CPUs. It inserts some NOP instructions on this particular CPU. It should be safe to backport.

Approval Request Comment
[Feature/regressing bug #]: -
[User impact if declined]: Potential crashes with certain AMD CPUs
[Describe test coverage new/current, TreeHerder]: On m-c and aurora for a while. I haven't seen these crashes on builds with this patch applied, but it's hard to say much because the crashes don't seem to affect all builds.
[Risks and why]: Low risk.
[String/UUID change made/needed]: None.
Attachment #8764587 - Flags: approval-mozilla-beta?
Comment on attachment 8764587 [details] [diff] [review]
Patch

Let's take it to try to fix the famous issue.
Should be in 49 beta 3
Attachment #8764587 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 12

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/7af8faeaf115
status-firefox49: --- → fixed
Looks like we have some new crashes on beta 3 described in bug 1294963. If you think it's from this, can you back out your changes or submit a fix? Thanks!
Flags: needinfo?(jdemooij)
On 2nd look, this does seem to have caused the #1 topcrash on beta 3.  Wes, can you back it out before the beta 5 build?
(Reporter)

Comment 15

2 years ago
I'll take a closer look, but it's *extremely unlikely* this caused these crashes.

What makes us think this is related to bug 1294963? The AMD bug shows up in some builds but not others. Just because this patch fixes one instance of this bug by inserting NOP instructions, doesn't mean it caused a random bug in entirely unrelated code...
(Reporter)

Comment 16

2 years ago
This workaround probably didn't work. I still see similar crashes like this one:

https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7-e96e22160815

We did correctly identify the CPU and inserted NOP instructions, but apparently that was not enough to stop the crashes. I'll check the errata and conversations with AMD tomorrow to see if there's something I missed... I'll also check if there's any difference in frequency.

006a4530 83 f9 8c              cmp    ecx,0xffffff8c   <- crash here
006a4533 0f 85 16 00 00 00     jne    0x006a454f
006a4539 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
006a453d 39 57 10              cmp    DWORD PTR [edi+0x10],edx
006a4540 0f 85 09 00 00 00     jne    0x006a454f
006a4546 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
006a454a c3                    ret    
006a454b 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
006a454f 8b 7f 04              mov    edi,DWORD PTR [edi+0x4]
006a4552 ff 27                 jmp    DWORD PTR [edi]
Flags: needinfo?(jdemooij)
(Reporter)

Updated

2 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Priority: -- → P3
(In reply to Jan de Mooij [:jandem] from comment #16)
> This workaround probably didn't work. I still see similar crashes like this
> one:
> 
> https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7-
> e96e22160815
> 
> We did correctly identify the CPU and inserted NOP instructions, but
> apparently that was not enough to stop the crashes. I'll check the errata
> and conversations with AMD tomorrow to see if there's something I missed...
> I'll also check if there's any difference in frequency.
> 
> 006a4530 83 f9 8c              cmp    ecx,0xffffff8c   <- crash here
> 006a4533 0f 85 16 00 00 00     jne    0x006a454f
> 006a4539 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
> 006a453d 39 57 10              cmp    DWORD PTR [edi+0x10],edx
> 006a4540 0f 85 09 00 00 00     jne    0x006a454f
> 006a4546 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
> 006a454a c3                    ret    
> 006a454b 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
> 006a454f 8b 7f 04              mov    edi,DWORD PTR [edi+0x4]
> 006a4552 ff 27                 jmp    DWORD PTR [edi]

Bug 772330 comment 55 mentions that the two "next instructions" could be affected. Should we try two nops instead of one long nop?
platform-rel: --- → ?
Whiteboard: [platform-rel-AMD]

Updated

2 years ago
platform-rel: ? → -
Created attachment 8805497 [details] [diff] [review]
Increasing NOP size

Talking to AMD, they suggested to maybe increase to 32 bytes of NOPs. Next try to squash those bugs.
Attachment #8805497 - Flags: review?(jdemooij)
(Reporter)

Comment 19

2 years ago
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

Review of attachment 8805497 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, let's see if this works :)
Attachment #8805497 - Flags: review?(jdemooij) → review+

Comment 20

2 years ago
Pushed by hv1989@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d77c9965e9b2
Try to work around mysterious AMD crashes (take 2). r=jandem

Comment 21

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/d77c9965e9b2
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago2 years ago
status-firefox52: --- → fixed
Resolution: --- → FIXED
See Also: → bug 1312270
See Also: → bug 1176268
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

Approval Request Comment

[Feature/regressing bug #]:
/

[User impact if declined]:
This contains a possible fix for AMD Bobcat processors. We are not sure this is a definite fix, but we only see numbers high enough on beta to come to a conclusion if this fix works or doesn't work.

[Describe test coverage new/current, TreeHerder]:
Has been on nightly since last week. I also enabled this workaround on my own computer and this caused no problems.

[Risks and why]: 
We are improving a work-around for AMD bobcat processors only. We already had a work-around uplifted, but that didn't seem to fix the crashes. Hopefully this new patch does that. We increased the NOP size from 4-bytes to 32-bytes on specific positions. We don't expect any regressions or issues even if the work-around doesn't work.

[String/UUID change made/needed]:
/
Attachment #8805497 - Flags: approval-mozilla-beta?
Attachment #8805497 - Flags: approval-mozilla-aurora?
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

Let's stabilize this on Aurora51 before deciding whether to uplift to Beta50.
Attachment #8805497 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Updated

2 years ago
status-firefox51: --- → affected

Comment 24

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/30be0417e259
status-firefox51: affected → fixed
Hello Jan, Hannes, I don't think this is a release blocker. We are in code freeze mode and only taking fixes for critical new regressions, crashes, security issues. I would like to wontfix this for 50 unless you can provide me with data that this is really valuable based on Nightly52/Aurora51 data. Thanks!
Flags: needinfo?(jdemooij)
Flags: needinfo?(hv1989)

Updated

2 years ago
status-firefox50: fixed → wontfix

Updated

2 years ago
Attachment #8805497 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Hello DBaron, this is a late uplift in 50 RC. Do you think this is something worth taking a risk on and uplifting to 50 RC2? Or are we better off letting this stabilize on Beta51 (after merge day) for a few weeks and uplifting to planned 50.1.0 release? Appreciate your help.
Flags: needinfo?(dbaron)
The most important thing is to have this on Beta as soon as possible. Since we can only decide if this workaround fixed the issue or not when it hits Beta. We don't have enough people with this combination of hardware to meaningful trigger this on aurora/nightly.

As soon as this is on Beta we have to wait 2-3 weeks before we have insightful information about it.

Now the most important part now is getting information to drive the next decision.
1) This fixes the crash. In that case we should try to get this in a stable X.1 release.
2) This doesn't fix the crash. We will go back to drawing board to find another workaround.

Currently it doesn't make sense for us to force an uplift for a 50.1.0 release without having crash information on Beta.
As a result we should OR takes this on Beta50 OR wait 2 weeks and automatically have this on Beta51.
Flags: needinfo?(jdemooij)
Flags: needinfo?(hv1989)
As a side-note:

The workaround only triggers for "AMD Ontario" processors and is very benign. As a result even if the work-around doesn't work, we will not back this patch out. But let the removal of the work-around just ride the train.
I don't have a strong opinion here.  I'm generally pretty conservative on uplifts to RC, but if the patch is believed to be very safe I think it's worth considering, given that this was a significant number of crashes.
Flags: needinfo?(dbaron)
Depends on: 1315196
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

I have given it some thought. I will take a leap of faith and take this uplift in 50 RC2. If things look worse, we back it out and build RC3. If things look same/better, we ship with it. If things look worse on release, we do a dot release with a better fix or back out this one. Let's hope this one sticks. Beta50+
Attachment #8805497 - Flags: approval-mozilla-beta- → approval-mozilla-beta+

Updated

2 years ago
status-firefox50: wontfix → affected

Updated

2 years ago
See Also: → bug 1316022

Comment 32

2 years ago
hi, rc2 still shows a particular crash pattern with some js related signatures (bug 1316022) that are correlated to the amd cpus: http://bit.ly/2fduzlh
Flags: needinfo?(jdemooij)
(In reply to [:philipp] from comment #32)
> hi, rc2 still shows a particular crash pattern with some js related
> signatures (bug 1316022) that are correlated to the amd cpus:
> http://bit.ly/2fduzlh

This bug is about the code generated by SpiderMonkey JIT compilers (as written in the bug title).

Bug 1316022 is dealing with stack trace of code generated by MSVC.  If you got a report with no stack s trace then this might be related to this bug.
Flags: needinfo?(jdemooij)
(In reply to Nicolas B. Pierron [:nbp] from comment #33)
> (In reply to [:philipp] from comment #32)
> > hi, rc2 still shows a particular crash pattern with some js related
> > signatures (bug 1316022) that are correlated to the amd cpus:
> > http://bit.ly/2fduzlh
> 
> This bug is about the code generated by SpiderMonkey JIT compilers (as
> written in the bug title).
> 
> Bug 1316022 is dealing with stack trace of code generated by MSVC.  If you
> got a report with no stack s trace then this might be related to this bug.

What do you mean by 'report with no stack trace'? Do you mean stack traces that contain code generated by the JIT?

If that's the case, we might have a few reports: https://crash-stats.mozilla.com/search/?build_id=20161104212021&signature=%3Djs%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3Djs%3A%3ANativeObject%3A%3AputProperty&signature=%3DAddOrChangeProperty&signature=%3DnsCOMPtr_base%3A%3A~nsCOMPtr_base%20%7C%20mozilla%3A%3Adom%3A%3AWriteOp%3A%3AInit&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AputProperty&signature=%3Djs%3A%3Ajit%3A%3ACallInfo%3A%3AsetImplicitlyUsedUnchecked&signature=~EnterIon&proto_signature=~EnterBaseline&proto_signature=~EnterBaselineMethod&product=Firefox&process_type=browser&date=%3E%3D2016-11-05T00%3A00%3A00.000Z&date=%3C2016-11-09T13%3A45%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_info&_facets=proto_signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature

I've limited the search to stack traces that contain EnterBaseline, EnterBaselineMethod or EnterIon.

Could you check if one of those fits the conditions?
(In reply to Marco Castelluccio [:marco] from comment #34)
> (In reply to Nicolas B. Pierron [:nbp] from comment #33)
> > (In reply to [:philipp] from comment #32)
> > > hi, rc2 still shows a particular crash pattern with some js related
> > > signatures (bug 1316022) that are correlated to the amd cpus:
> > > http://bit.ly/2fduzlh
> > 
> > This bug is about the code generated by SpiderMonkey JIT compilers (as
> > written in the bug title).
> > 
> > Bug 1316022 is dealing with stack trace of code generated by MSVC.  If you
> > got a report with no stack s trace then this might be related to this bug.
> 
> What do you mean by 'report with no stack trace'? Do you mean stack traces
> that contain code generated by the JIT?

Yes, this means that the stack frames do not have much symbols, and the next valid frame should be something such as EnterIon / EnterBaseline.

> If that's the case, we might have a few reports: […]
> 
> I've limited the search to stack traces that contain EnterBaseline,
> EnterBaselineMethod or EnterIon.
> 
> Could you check if one of those fits the conditions?

I will try to explain the reasoning I have on one example:

  @0x0                                    ,  This seems to appear when some valid stacks exists
  js::NativeObject::addPropertyInternal   |  below them.  Thus these might be called by the JIT but
  xpc::JSXrayTraits::getPrototype         '  not caused by it.
  js::Shape::search<T>                    
  js::AddTypePropertyId                   
  AddOrChangeProperty                     
  js::SetPropertyByDefining               
  mozglue.dll@0x44af                      ,
  @0xffffff8b                             |  These stack frames are in complete random-ish
  js::jit::SetPropertyIC::update          |  order, and this might not correspond
  @0x2e70d7                               |  to any valid call path which can be produced
  @0xffffff84                             |  by either our code-base, or by code generated
  intrinsic_OwnPropertyKeys               |  by the JIT compilers.
  @0x75d212f                              |
  @0x2516585f                             |  The most likely source of this non-sense are
  @0x168db13                              |  the remains of previous stack frames pushed
  @0x24b676ff                             |  on the stack.
  @0x7310942                              '
  EnterBaseline                           ,
  js::jit::EnterBaselineMethod            |
  Interpret                               |
  js::RunScript                           | Valid call stack for entering JIT code.
  js::InternalCallOrConstruct             |
  js::jit::InvokeFunction                 |
  js::jit::TryAttachCallStub              '

Thus, from my opinion, this precise stack frame does not seems to be a bug in our JIT compiler.  If we were looking for a bug in our JIT compiler I would expect the stack frames to look something like:

  @0x75d212f                              '
  js::jit::SetPropertyIC::update          |  Random mess from the stack, even if some of the mess
  @0x168db13                              |  can be attributed a valid name.
  @0x24b676ff                             |
  @0x7310942                              '
  EnterBaseline                           ,
  js::jit::EnterBaselineMethod            |
  Interpret                               |
  js::RunScript                           | Valid call stack for entering JIT code.
  js::InternalCallOrConstruct             |
  js::jit::InvokeFunction                 |
  js::jit::TryAttachCallStub              '
Some of the reports with signature "jit | CORRUPT_CODE" on 50.1.0 seem to be failing in JIT compiled code, e.g. bp-a884f584-e3df-4167-84e9-4adf82161220:
@0x71abc33 	
@0x75087de 	
@0x406ecdf7 	
@0x3ce20942 	
EnterBaseline
js::jit::EnterBaselineMethod(JSContext*, js::RunState&)
js::RunScript(JSContext*, js::RunState&)
js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)
InternalCall
@0xffffff8b

(We have a spike of crashes related to the usual AMD CPUs in 50.1.0, some of them aren't happening in JIT code, but some of them might be. The bug is bug 1287087).
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(hv1989)
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(hv1989)
(Assignee)

Comment 37

a year ago
I think this bug is responsible for these crashes in 57beta: https://crash-stats.mozilla.org/signature/?product=Firefox&version=57.0b3&address=%3D0xffffffffffffffff&signature=EnterBaseline&date=%3E%3D2017-09-20T22%3A19%3A05.000Z&date=%3C2017-09-27T22%3A19%3A05.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports

I see nonsense memory violations at addresses without memory access. They happen in stubs with the work-around added by this patch, but happen at beginning of stub before the NOPs.
(Assignee)

Updated

a year ago
Alias: amdbug
(Assignee)

Updated

a year ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 1274659
status-firefox57: --- → affected
status-firefox58: --- → affected
status-firefox59: --- → affected
Keywords: crash

Comment 39

a year ago
Well, well, well. Look what we have here. 
AMD's family 20, 14h, meaning Bobcat cpus. 

50€ this is about what I extensively reported in bug 1335925, comment 1. 
You are going to have a pretty bad time trying to 1. workaround the issue and 2. detect when to apply workaround in the first place. 
Link again to the (likely) problem: http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=63

For starters, I'm not sure how you'd manage to avoid a vague "set of internal timing conditions", if not by blindly pushing patches over and over again, waiting for wheel of luck to spin. 
Secondly, don't expect microcode to mean anything at all. 
This is about something else getting updated in the bios. AGESA maybe. Anyway, something that to the best of my knowledge is hard to figure out even after disassembling bios. 

Proper way to do control, would be to check MSRs, but that's really not doable (bug 772330, comment 25). 
A manual blacklist of 'crashy' unpatched bios revisions, if any, is the only thing that I see could *remotely* do something. 
(even though that'd still be suboptimal for cases where you have a CPU with bug fixed in silicon, but also bug-unpatched bios)

Also worth mentioning that linux got a fix backed in the OS. 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfc1168de949cd3e9ca18c3480b5085deff1ea7c
If you use anything newer than 4.14-rc7, you should be 100% safe. 

On windows it seems like AMD itself used to take care of implementing similar solutions (AMDeFix in their chipset drivers), which I think could even be pushed by microsoft universally.. But for the moment, no luck at all. 

Regards.

Updated

a year ago
Blocks: 1335925
(Assignee)

Comment 40

a year ago
Thanks for the info! I've got my hands on some machines with the chip now. I wonder if we should consider disabling JITs on this processor so at least the browser doesn't crash all day =\

Comment 41

a year ago
Afaik (bug 1298333) :marco should also have one. 
Or better, the apu family should be the right one. But god knows if it's one of those fixed batches or not. 

It shouldn't be all that hard to manually check MSRs on linux though. 
Or run my test case in bug 1335925 under windows.
(In reply to Ted Campbell [:tcampbell] from comment #40)
> Thanks for the info! I've got my hands on some machines with the chip now. I
> wonder if we should consider disabling JITs on this processor so at least
> the browser doesn't crash all day =\

Ted: How common are these buggy AMD processors? Do these crashes only affect 64-bit Firefox? IIUC, EnterBaseline is a pretty generic crash signature so I don't know how to filter crash reports for this specific AMD issue.

jandem: Should we disable JITs for these AMD users? How much will that hurt performance?
Flags: needinfo?(jdemooij)

Comment 43

a year ago
Afaik they are so common that you ended up blacklisting the *whole* AMD drivers, just in an attempt to workaround this bug. 
Almost a ten of times during the years. Sometimes even nailing down the issue (can't give enough credits to dmajor), but basically always forgetting or passing over further specific actions. 

If you check bug 1335925, you can see my test case was actually done with FF19, so it's by no means 64-bit only. 
Just, as I was saying, the conditions to trigger the bug are pretty much a lottery. 

Isn't there some amd engineer around the bugzilla that may have further background?
(Assignee)

Comment 44

a year ago
I'm not familiar with graphics drivers or the blacklist process, so I'll leave that discussion to the other bug.

Scope of this bug is reducing the EnterBaseline crash signature. We are currently seeing this on Firefox 57 64-bit quite a bit on these CPUs. We've applied mitigations in the past on the advice of AMD engineers, but it seems the combination of codegen tweaks and our 64-bit build in 57 has made this appear in larger than previously see.
(Assignee)

Comment 45

a year ago
Installed Windows 10 64-bit on a Lenovo X130e and made it 10 seconds on Firefox 57 before it crashed =\
(Reporter)

Comment 46

a year ago
Forwarding to Ted since he's working on this.
Flags: needinfo?(jdemooij) → needinfo?(tcampbell)

Comment 47

a year ago
(In reply to Ted Campbell [:tcampbell] from comment #45)
> Installed Windows 10 64-bit on a Lenovo X130e and made it 10 seconds on
> Firefox 57 before it crashed =\

Could not reproduce on W7 on my K53U with unpatched bios.
(Assignee)

Comment 48

a year ago
It seems it was just luck it crashed so fast. The two machines were able to stay up for the hour I observed them. Will enable full crash dumps today and somehow script them to stop / stop firefox and load a page. I'd like to get a dump of what path is entering the TypeMonitor stub and determine if adding the NOPs to that end improves stability enough.

Also, as suggested by :jandem on IRC, we shouldn't have to go so far as disabling full JIT, but rather disable key IC stubs on affected machines.
Flags: needinfo?(tcampbell)
(Assignee)

Comment 49

a year ago
Looking at crash stats, these crashes are happening on up-to-date Windows 10 / 8.1 systems with catalyst video driver from windows update and cpu microcode updates applied.

Comment 50

a year ago
No Windows 7?

Also, putting aside that at least the provided W10 driver was quite of a letdown last year, that makes sense. 
If my digging through coreboot is right, the fix for erratum 688 pass through an AGESA update, which in turn requires a totally new bios release. 
Or at least I see this correlation in the newer fixed one for my laptop. 

I wonder if said AGESA release number couldn't be read from userspace?
(Assignee)

Comment 51

a year ago
There are Windows 7 crashes too.

I'm having trouble reproducing still. Based on mirh's observations that old microcode was really unstable, I've disabled Windows microcode updates on one of these and am still not hitting it. The AGESA update thing is interesting. I'll see if I can find any info on what these laptops have.

Current plan is still to try and capture a full process crash dump and investigate the JIT code paths preceding the crash. Mitigation will likely be adding more of the no-ops or disabling the unlucky jit stubs in order to reduce rate of crashes to something more acceptable.

Comment 52

a year ago
(In reply to Ted Campbell [:tcampbell] from comment #51)
> There are Windows 7 crashes too.

Oh, ok, I was almost panicking to think the contrary. 

Said this, again, for the fourth time: microcode means *nothing*. 
0x05000029 (the most updated one) yes, does something, but it really fixes all but the errata we are interested in. 
https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/tree/microcode_amd.bin.README#n26

AGESA version can usually be easily checked by opening bios image in hex editor and looking for "OntaroPI" string. 
Or I think even AIDA64 should be able to detect it, though the magic involved misses me. 

At least in my laptop, broken bios have v1.1.0.0, while fixed ones are 1.2.0.0, so that was the educate guess I made. 
The last word to know whether you are affected by bug though (according to docs) would be running under linux

sudo setpci -s 18.4 0x164.l

sudo modprobe msr
sudo rdmsr --all 0xc0011021

And check if the lasts of these commands returns respectively 00000003 and 10008000. 

Sad reproducing it on 57 is so difficult.
(Assignee)

Comment 53

a year ago
(In reply to mirh from comment #52)
> Said this, again, for the fourth time: microcode means *nothing*. 
> 0x05000029 (the most updated one) yes, does something, but it really fixes
> all but the errata we are interested in. 
> https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/tree/
> microcode_amd.bin.README#n26

Ah, nice link. I now follow what you were saying.


> AGESA version can usually be easily checked by opening bios image in hex
> editor and looking for "OntaroPI" string. 
> Or I think even AIDA64 should be able to detect it, though the magic
> involved misses me. 

AGESA 1.1.0.0 is what both factory and "updated" firmware images show for these X130e machines.


> sudo setpci -s 18.4 0x164.l
> 
> sudo modprobe msr
> sudo rdmsr --all 0xc0011021
> 
> And check if the lasts of these commands returns respectively 00000003 and
> 10008000.

I get 00000007 and 10008000 on this one laptop, indicating it should already be fixed. I'll check the other laptop on Monday. (I'm don't have my notes here on which one triggered the bug).

Comment 54

a year ago
(In reply to Ted Campbell [:tcampbell] from comment #53)
> I get 00000007 and 10008000 on this one laptop, indicating it should already
> be fixed. I'll check the other laptop on Monday. (I'm don't have my notes
> here on which one triggered the bug).

Loool. No **** then you wasn't experiencing the issue!!
I must correct though. 
In addition to 00000003, even 00000001 signals a bug. 
http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=9
[Tracking Requested - why for this release]:

This 64-bit EnterBaseline JIT crash is 57.0 content process top crash #2.
tracking-firefox57: --- → ?
Ted, Release Management plans to ship a 57.0.1 dot release this week. Do you think we should temporarily disable the JIT for users on these AMD CPUs in 57.0.1? You mentioned on Slack that some people are hitting this crash 10+ times a day.
Flags: needinfo?(tcampbell)
Ted says we shouldn't rush a fix for 57.0.1. This is sensitive code and, while we are receiving many crash reports, the actual number of affected users is relatively small. On Beta 58, for example, we have about 5000 reports from just 150 installs.
tracking-firefox57: ? → ---
Flags: needinfo?(tcampbell)
(Assignee)

Comment 58

a year ago
It is looking like only a fraction users of this processor are hitting the problem badly. We don't have a reliable test for identifying which are affected, so it is unwise to risk hurting perf of anyone using this processor. People who are affected should use 32-bit builds.

The second sample laptop I have does report (via privileged mode tests that mirh described above) as having bad silicon. I'm continuing to try and reproduce the issue more than once.

We should continue to track this issue, but fixes should be attempted in beta first.
[Tracking Requested - why for this release]:
tracking-firefox58: --- → ?
(Assignee)

Comment 60

a year ago
Actually, rates in beta are much lower than I thought. 4000/4700 crashes in last week are a single user hitting a crash report issue that caused it to submit a single crash over and over again. Once correcting for that, the AMD bug is still the cause of 100 crashes a day on beta 58. These AMD crashes are still almost all 64-bit. We should still monitor this bug, but it isn't really #2 crash on beta.
(Assignee)

Updated

a year ago
Depends on: 1421445
Not sure we should rush anything across trains here.
status-firefox57: affected → wontfix
status-firefox58: affected → fix-optional
tracking-firefox58: ? → ---
(Reporter)

Updated

11 months ago
Assignee: jdemooij → tcampbell
You need to log in before you can comment on or make changes to this bug.