Closed Bug 1281759 (amdbug) Opened 8 years ago Closed 3 years ago

Work around mysterious AMD JIT crashes

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
platform-rel --- -
firefox49 --- fixed
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix

People

(Reporter: jandem, Assigned: jandem)

References

(Depends on 1 open bug, Blocks 3 open bugs)

Details

(Keywords: crash, Whiteboard: [platform-rel-AMD])

Attachments

(2 files)

See the AMD crashes in bug 1034706 comment 44.

This still happens on 47. I want to add some nops to the Baseline stub where we're crashing (we can probably use cpuid to restrict it to these AMD CPUs) to see if that makes it go away. Info from AMD suggests it might help.
No longer blocks: 1279590
Blocks: 1279590
Attached patch Patch — — Splinter Review
This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and model is 0-2.

According to AMD engineers, limiting the number of branches per cache line might help, so I'm hopeful this will work. If this patch doesn't help, we can try more or larger NOPs...
Attachment #8764587 - Flags: review?(sunfish)
Comment on attachment 8764587 [details] [diff] [review]
Patch

Review of attachment 8764587 [details] [diff] [review]:
-----------------------------------------------------------------

Yikes. Patch looks good though.
Attachment #8764587 - Flags: review?(sunfish) → review+
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ae024662143f
Try to work around mysterious AMD crashes. r=sunfish
https://hg.mozilla.org/mozilla-central/rev/ae024662143f
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Flags: needinfo?(jdemooij)
> This adds some 4-byte NOPs to this IC stub on x86 if CPU family is 20 and
> model is 0-2.

Do we have confirmation from AMD that these models are the only ones affected? My analysis in bug 1034706 comment 63 suggests that these families account for only a fraction of suspicious JIT crashes on AMD CPUs.
Also, do you have an analysis plan to determine if this change helps in practice?
(In reply to Nicholas Nethercote [:njn] from comment #5)
> Do we have confirmation from AMD that these models are the only ones
> affected? My analysis in bug 1034706 comment 63 suggests that these families
> account for only a fraction of suspicious JIT crashes on AMD CPUs.

No confirmation, but I haven't seen *these* crashes with other (AMD) CPUs. I can take another look at some of the other crashes though.

(In reply to Nicholas Nethercote [:njn] from comment #6)
> Also, do you have an analysis plan to determine if this change helps in
> practice?

I'll check if there's any difference on Nightly later today. Unfortunately not that many users are on Nightly and most probably don't use these CPUs, so we'll likely have to backport to see if it helps...
This might fix bug 1290388.
Blocks: 1290388
See Also: → 1290419
(In reply to Marco Castelluccio [:marco] (PTO until August 9) from comment #8)
> This might fix bug 1290388.

I don't think so; I moved it to the See Also list :)
No longer blocks: 1290388
Flags: needinfo?(jdemooij)
See Also: → 1290388
Comment on attachment 8764587 [details] [diff] [review]
Patch

This patch attempts to work around JIT code crashes on certain AMD CPUs. It inserts some NOP instructions on this particular CPU. It should be safe to backport.

Approval Request Comment
[Feature/regressing bug #]: -
[User impact if declined]: Potential crashes with certain AMD CPUs
[Describe test coverage new/current, TreeHerder]: On m-c and aurora for a while. I haven't seen these crashes on builds with this patch applied, but it's hard to say much because the crashes don't seem to affect all builds.
[Risks and why]: Low risk.
[String/UUID change made/needed]: None.
Attachment #8764587 - Flags: approval-mozilla-beta?
Comment on attachment 8764587 [details] [diff] [review]
Patch

Let's take it to try to fix the famous issue.
Should be in 49 beta 3
Attachment #8764587 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Looks like we have some new crashes on beta 3 described in bug 1294963. If you think it's from this, can you back out your changes or submit a fix? Thanks!
Flags: needinfo?(jdemooij)
On 2nd look, this does seem to have caused the #1 topcrash on beta 3.  Wes, can you back it out before the beta 5 build?
I'll take a closer look, but it's *extremely unlikely* this caused these crashes.

What makes us think this is related to bug 1294963? The AMD bug shows up in some builds but not others. Just because this patch fixes one instance of this bug by inserting NOP instructions, doesn't mean it caused a random bug in entirely unrelated code...
This workaround probably didn't work. I still see similar crashes like this one:

https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7-e96e22160815

We did correctly identify the CPU and inserted NOP instructions, but apparently that was not enough to stop the crashes. I'll check the errata and conversations with AMD tomorrow to see if there's something I missed... I'll also check if there's any difference in frequency.

006a4530 83 f9 8c              cmp    ecx,0xffffff8c   <- crash here
006a4533 0f 85 16 00 00 00     jne    0x006a454f
006a4539 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
006a453d 39 57 10              cmp    DWORD PTR [edi+0x10],edx
006a4540 0f 85 09 00 00 00     jne    0x006a454f
006a4546 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
006a454a c3                    ret    
006a454b 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
006a454f 8b 7f 04              mov    edi,DWORD PTR [edi+0x4]
006a4552 ff 27                 jmp    DWORD PTR [edi]
Flags: needinfo?(jdemooij)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Priority: -- → P3
(In reply to Jan de Mooij [:jandem] from comment #16)
> This workaround probably didn't work. I still see similar crashes like this
> one:
> 
> https://crash-stats.mozilla.com/report/index/6a120ee3-cf18-48cc-8dc7-
> e96e22160815
> 
> We did correctly identify the CPU and inserted NOP instructions, but
> apparently that was not enough to stop the crashes. I'll check the errata
> and conversations with AMD tomorrow to see if there's something I missed...
> I'll also check if there's any difference in frequency.
> 
> 006a4530 83 f9 8c              cmp    ecx,0xffffff8c   <- crash here
> 006a4533 0f 85 16 00 00 00     jne    0x006a454f
> 006a4539 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
> 006a453d 39 57 10              cmp    DWORD PTR [edi+0x10],edx
> 006a4540 0f 85 09 00 00 00     jne    0x006a454f
> 006a4546 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
> 006a454a c3                    ret    
> 006a454b 0f 1f 40 00           nop    DWORD PTR [eax+0x0]
> 006a454f 8b 7f 04              mov    edi,DWORD PTR [edi+0x4]
> 006a4552 ff 27                 jmp    DWORD PTR [edi]

Bug 772330 comment 55 mentions that the two "next instructions" could be affected. Should we try two nops instead of one long nop?
platform-rel: --- → ?
Whiteboard: [platform-rel-AMD]
platform-rel: ? → -
Attached patch Increasing NOP size — — Splinter Review
Talking to AMD, they suggested to maybe increase to 32 bytes of NOPs. Next try to squash those bugs.
Attachment #8805497 - Flags: review?(jdemooij)
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

Review of attachment 8805497 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, let's see if this works :)
Attachment #8805497 - Flags: review?(jdemooij) → review+
Pushed by hv1989@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d77c9965e9b2
Try to work around mysterious AMD crashes (take 2). r=jandem
https://hg.mozilla.org/mozilla-central/rev/d77c9965e9b2
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
See Also: → 1312270
See Also: → 1176268
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

Approval Request Comment

[Feature/regressing bug #]:
/

[User impact if declined]:
This contains a possible fix for AMD Bobcat processors. We are not sure this is a definite fix, but we only see numbers high enough on beta to come to a conclusion if this fix works or doesn't work.

[Describe test coverage new/current, TreeHerder]:
Has been on nightly since last week. I also enabled this workaround on my own computer and this caused no problems.

[Risks and why]: 
We are improving a work-around for AMD bobcat processors only. We already had a work-around uplifted, but that didn't seem to fix the crashes. Hopefully this new patch does that. We increased the NOP size from 4-bytes to 32-bytes on specific positions. We don't expect any regressions or issues even if the work-around doesn't work.

[String/UUID change made/needed]:
/
Attachment #8805497 - Flags: approval-mozilla-beta?
Attachment #8805497 - Flags: approval-mozilla-aurora?
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

Let's stabilize this on Aurora51 before deciding whether to uplift to Beta50.
Attachment #8805497 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Hello Jan, Hannes, I don't think this is a release blocker. We are in code freeze mode and only taking fixes for critical new regressions, crashes, security issues. I would like to wontfix this for 50 unless you can provide me with data that this is really valuable based on Nightly52/Aurora51 data. Thanks!
Flags: needinfo?(jdemooij)
Flags: needinfo?(hv1989)
Attachment #8805497 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Hello DBaron, this is a late uplift in 50 RC. Do you think this is something worth taking a risk on and uplifting to 50 RC2? Or are we better off letting this stabilize on Beta51 (after merge day) for a few weeks and uplifting to planned 50.1.0 release? Appreciate your help.
Flags: needinfo?(dbaron)
The most important thing is to have this on Beta as soon as possible. Since we can only decide if this workaround fixed the issue or not when it hits Beta. We don't have enough people with this combination of hardware to meaningful trigger this on aurora/nightly.

As soon as this is on Beta we have to wait 2-3 weeks before we have insightful information about it.

Now the most important part now is getting information to drive the next decision.
1) This fixes the crash. In that case we should try to get this in a stable X.1 release.
2) This doesn't fix the crash. We will go back to drawing board to find another workaround.

Currently it doesn't make sense for us to force an uplift for a 50.1.0 release without having crash information on Beta.
As a result we should OR takes this on Beta50 OR wait 2 weeks and automatically have this on Beta51.
Flags: needinfo?(jdemooij)
Flags: needinfo?(hv1989)
As a side-note:

The workaround only triggers for "AMD Ontario" processors and is very benign. As a result even if the work-around doesn't work, we will not back this patch out. But let the removal of the work-around just ride the train.
I don't have a strong opinion here.  I'm generally pretty conservative on uplifts to RC, but if the patch is believed to be very safe I think it's worth considering, given that this was a significant number of crashes.
Flags: needinfo?(dbaron)
Depends on: 1315196
Comment on attachment 8805497 [details] [diff] [review]
Increasing NOP size

I have given it some thought. I will take a leap of faith and take this uplift in 50 RC2. If things look worse, we back it out and build RC3. If things look same/better, we ship with it. If things look worse on release, we do a dot release with a better fix or back out this one. Let's hope this one sticks. Beta50+
Attachment #8805497 - Flags: approval-mozilla-beta- → approval-mozilla-beta+
See Also: → 1316022
hi, rc2 still shows a particular crash pattern with some js related signatures (bug 1316022) that are correlated to the amd cpus: http://bit.ly/2fduzlh
Flags: needinfo?(jdemooij)
(In reply to [:philipp] from comment #32)
> hi, rc2 still shows a particular crash pattern with some js related
> signatures (bug 1316022) that are correlated to the amd cpus:
> http://bit.ly/2fduzlh

This bug is about the code generated by SpiderMonkey JIT compilers (as written in the bug title).

Bug 1316022 is dealing with stack trace of code generated by MSVC.  If you got a report with no stack s trace then this might be related to this bug.
Flags: needinfo?(jdemooij)
(In reply to Nicolas B. Pierron [:nbp] from comment #33)
> (In reply to [:philipp] from comment #32)
> > hi, rc2 still shows a particular crash pattern with some js related
> > signatures (bug 1316022) that are correlated to the amd cpus:
> > http://bit.ly/2fduzlh
> 
> This bug is about the code generated by SpiderMonkey JIT compilers (as
> written in the bug title).
> 
> Bug 1316022 is dealing with stack trace of code generated by MSVC.  If you
> got a report with no stack s trace then this might be related to this bug.

What do you mean by 'report with no stack trace'? Do you mean stack traces that contain code generated by the JIT?

If that's the case, we might have a few reports: https://crash-stats.mozilla.com/search/?build_id=20161104212021&signature=%3Djs%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3Djs%3A%3ANativeObject%3A%3AputProperty&signature=%3DAddOrChangeProperty&signature=%3DnsCOMPtr_base%3A%3A~nsCOMPtr_base%20%7C%20mozilla%3A%3Adom%3A%3AWriteOp%3A%3AInit&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AaddPropertyInternal&signature=%3D%400x0%20%7C%20js%3A%3ANativeObject%3A%3AputProperty&signature=%3Djs%3A%3Ajit%3A%3ACallInfo%3A%3AsetImplicitlyUsedUnchecked&signature=~EnterIon&proto_signature=~EnterBaseline&proto_signature=~EnterBaselineMethod&product=Firefox&process_type=browser&date=%3E%3D2016-11-05T00%3A00%3A00.000Z&date=%3C2016-11-09T13%3A45%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_info&_facets=proto_signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-proto_signature

I've limited the search to stack traces that contain EnterBaseline, EnterBaselineMethod or EnterIon.

Could you check if one of those fits the conditions?
(In reply to Marco Castelluccio [:marco] from comment #34)
> (In reply to Nicolas B. Pierron [:nbp] from comment #33)
> > (In reply to [:philipp] from comment #32)
> > > hi, rc2 still shows a particular crash pattern with some js related
> > > signatures (bug 1316022) that are correlated to the amd cpus:
> > > http://bit.ly/2fduzlh
> > 
> > This bug is about the code generated by SpiderMonkey JIT compilers (as
> > written in the bug title).
> > 
> > Bug 1316022 is dealing with stack trace of code generated by MSVC.  If you
> > got a report with no stack s trace then this might be related to this bug.
> 
> What do you mean by 'report with no stack trace'? Do you mean stack traces
> that contain code generated by the JIT?

Yes, this means that the stack frames do not have much symbols, and the next valid frame should be something such as EnterIon / EnterBaseline.

> If that's the case, we might have a few reports: […]
> 
> I've limited the search to stack traces that contain EnterBaseline,
> EnterBaselineMethod or EnterIon.
> 
> Could you check if one of those fits the conditions?

I will try to explain the reasoning I have on one example:

  @0x0                                    ,  This seems to appear when some valid stacks exists
  js::NativeObject::addPropertyInternal   |  below them.  Thus these might be called by the JIT but
  xpc::JSXrayTraits::getPrototype         '  not caused by it.
  js::Shape::search<T>                    
  js::AddTypePropertyId                   
  AddOrChangeProperty                     
  js::SetPropertyByDefining               
  mozglue.dll@0x44af                      ,
  @0xffffff8b                             |  These stack frames are in complete random-ish
  js::jit::SetPropertyIC::update          |  order, and this might not correspond
  @0x2e70d7                               |  to any valid call path which can be produced
  @0xffffff84                             |  by either our code-base, or by code generated
  intrinsic_OwnPropertyKeys               |  by the JIT compilers.
  @0x75d212f                              |
  @0x2516585f                             |  The most likely source of this non-sense are
  @0x168db13                              |  the remains of previous stack frames pushed
  @0x24b676ff                             |  on the stack.
  @0x7310942                              '
  EnterBaseline                           ,
  js::jit::EnterBaselineMethod            |
  Interpret                               |
  js::RunScript                           | Valid call stack for entering JIT code.
  js::InternalCallOrConstruct             |
  js::jit::InvokeFunction                 |
  js::jit::TryAttachCallStub              '

Thus, from my opinion, this precise stack frame does not seems to be a bug in our JIT compiler.  If we were looking for a bug in our JIT compiler I would expect the stack frames to look something like:

  @0x75d212f                              '
  js::jit::SetPropertyIC::update          |  Random mess from the stack, even if some of the mess
  @0x168db13                              |  can be attributed a valid name.
  @0x24b676ff                             |
  @0x7310942                              '
  EnterBaseline                           ,
  js::jit::EnterBaselineMethod            |
  Interpret                               |
  js::RunScript                           | Valid call stack for entering JIT code.
  js::InternalCallOrConstruct             |
  js::jit::InvokeFunction                 |
  js::jit::TryAttachCallStub              '
Some of the reports with signature "jit | CORRUPT_CODE" on 50.1.0 seem to be failing in JIT compiled code, e.g. bp-a884f584-e3df-4167-84e9-4adf82161220:
@0x71abc33 	
@0x75087de 	
@0x406ecdf7 	
@0x3ce20942 	
EnterBaseline
js::jit::EnterBaselineMethod(JSContext*, js::RunState&)
js::RunScript(JSContext*, js::RunState&)
js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)
InternalCall
@0xffffff8b

(We have a spike of crashes related to the usual AMD CPUs in 50.1.0, some of them aren't happening in JIT code, but some of them might be. The bug is bug 1287087).
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(hv1989)
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(hv1989)
Alias: amdbug
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Well, well, well. Look what we have here. 
AMD's family 20, 14h, meaning Bobcat cpus. 

50€ this is about what I extensively reported in bug 1335925, comment 1. 
You are going to have a pretty bad time trying to 1. workaround the issue and 2. detect when to apply workaround in the first place. 
Link again to the (likely) problem: http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=63

For starters, I'm not sure how you'd manage to avoid a vague "set of internal timing conditions", if not by blindly pushing patches over and over again, waiting for wheel of luck to spin. 
Secondly, don't expect microcode to mean anything at all. 
This is about something else getting updated in the bios. AGESA maybe. Anyway, something that to the best of my knowledge is hard to figure out even after disassembling bios. 

Proper way to do control, would be to check MSRs, but that's really not doable (bug 772330, comment 25). 
A manual blacklist of 'crashy' unpatched bios revisions, if any, is the only thing that I see could *remotely* do something. 
(even though that'd still be suboptimal for cases where you have a CPU with bug fixed in silicon, but also bug-unpatched bios)

Also worth mentioning that linux got a fix backed in the OS. 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfc1168de949cd3e9ca18c3480b5085deff1ea7c
If you use anything newer than 4.14-rc7, you should be 100% safe. 

On windows it seems like AMD itself used to take care of implementing similar solutions (AMDeFix in their chipset drivers), which I think could even be pushed by microsoft universally.. But for the moment, no luck at all. 

Regards.
Blocks: 1335925
Thanks for the info! I've got my hands on some machines with the chip now. I wonder if we should consider disabling JITs on this processor so at least the browser doesn't crash all day =\
Afaik (bug 1298333) :marco should also have one. 
Or better, the apu family should be the right one. But god knows if it's one of those fixed batches or not. 

It shouldn't be all that hard to manually check MSRs on linux though. 
Or run my test case in bug 1335925 under windows.
(In reply to Ted Campbell [:tcampbell] from comment #40)
> Thanks for the info! I've got my hands on some machines with the chip now. I
> wonder if we should consider disabling JITs on this processor so at least
> the browser doesn't crash all day =\

Ted: How common are these buggy AMD processors? Do these crashes only affect 64-bit Firefox? IIUC, EnterBaseline is a pretty generic crash signature so I don't know how to filter crash reports for this specific AMD issue.

jandem: Should we disable JITs for these AMD users? How much will that hurt performance?
Flags: needinfo?(jdemooij)
Afaik they are so common that you ended up blacklisting the *whole* AMD drivers, just in an attempt to workaround this bug. 
Almost a ten of times during the years. Sometimes even nailing down the issue (can't give enough credits to dmajor), but basically always forgetting or passing over further specific actions. 

If you check bug 1335925, you can see my test case was actually done with FF19, so it's by no means 64-bit only. 
Just, as I was saying, the conditions to trigger the bug are pretty much a lottery. 

Isn't there some amd engineer around the bugzilla that may have further background?
I'm not familiar with graphics drivers or the blacklist process, so I'll leave that discussion to the other bug.

Scope of this bug is reducing the EnterBaseline crash signature. We are currently seeing this on Firefox 57 64-bit quite a bit on these CPUs. We've applied mitigations in the past on the advice of AMD engineers, but it seems the combination of codegen tweaks and our 64-bit build in 57 has made this appear in larger than previously see.
Installed Windows 10 64-bit on a Lenovo X130e and made it 10 seconds on Firefox 57 before it crashed =\
Forwarding to Ted since he's working on this.
Flags: needinfo?(jdemooij) → needinfo?(tcampbell)
(In reply to Ted Campbell [:tcampbell] from comment #45)
> Installed Windows 10 64-bit on a Lenovo X130e and made it 10 seconds on
> Firefox 57 before it crashed =\

Could not reproduce on W7 on my K53U with unpatched bios.
It seems it was just luck it crashed so fast. The two machines were able to stay up for the hour I observed them. Will enable full crash dumps today and somehow script them to stop / stop firefox and load a page. I'd like to get a dump of what path is entering the TypeMonitor stub and determine if adding the NOPs to that end improves stability enough.

Also, as suggested by :jandem on IRC, we shouldn't have to go so far as disabling full JIT, but rather disable key IC stubs on affected machines.
Flags: needinfo?(tcampbell)
Looking at crash stats, these crashes are happening on up-to-date Windows 10 / 8.1 systems with catalyst video driver from windows update and cpu microcode updates applied.
No Windows 7?

Also, putting aside that at least the provided W10 driver was quite of a letdown last year, that makes sense. 
If my digging through coreboot is right, the fix for erratum 688 pass through an AGESA update, which in turn requires a totally new bios release. 
Or at least I see this correlation in the newer fixed one for my laptop. 

I wonder if said AGESA release number couldn't be read from userspace?
There are Windows 7 crashes too.

I'm having trouble reproducing still. Based on mirh's observations that old microcode was really unstable, I've disabled Windows microcode updates on one of these and am still not hitting it. The AGESA update thing is interesting. I'll see if I can find any info on what these laptops have.

Current plan is still to try and capture a full process crash dump and investigate the JIT code paths preceding the crash. Mitigation will likely be adding more of the no-ops or disabling the unlucky jit stubs in order to reduce rate of crashes to something more acceptable.
(In reply to Ted Campbell [:tcampbell] from comment #51)
> There are Windows 7 crashes too.

Oh, ok, I was almost panicking to think the contrary. 

Said this, again, for the fourth time: microcode means *nothing*. 
0x05000029 (the most updated one) yes, does something, but it really fixes all but the errata we are interested in. 
https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/tree/microcode_amd.bin.README#n26

AGESA version can usually be easily checked by opening bios image in hex editor and looking for "OntaroPI" string. 
Or I think even AIDA64 should be able to detect it, though the magic involved misses me. 

At least in my laptop, broken bios have v1.1.0.0, while fixed ones are 1.2.0.0, so that was the educate guess I made. 
The last word to know whether you are affected by bug though (according to docs) would be running under linux

sudo setpci -s 18.4 0x164.l

sudo modprobe msr
sudo rdmsr --all 0xc0011021

And check if the lasts of these commands returns respectively 00000003 and 10008000. 

Sad reproducing it on 57 is so difficult.
(In reply to mirh from comment #52)
> Said this, again, for the fourth time: microcode means *nothing*. 
> 0x05000029 (the most updated one) yes, does something, but it really fixes
> all but the errata we are interested in. 
> https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/tree/
> microcode_amd.bin.README#n26

Ah, nice link. I now follow what you were saying.


> AGESA version can usually be easily checked by opening bios image in hex
> editor and looking for "OntaroPI" string. 
> Or I think even AIDA64 should be able to detect it, though the magic
> involved misses me. 

AGESA 1.1.0.0 is what both factory and "updated" firmware images show for these X130e machines.


> sudo setpci -s 18.4 0x164.l
> 
> sudo modprobe msr
> sudo rdmsr --all 0xc0011021
> 
> And check if the lasts of these commands returns respectively 00000003 and
> 10008000.

I get 00000007 and 10008000 on this one laptop, indicating it should already be fixed. I'll check the other laptop on Monday. (I'm don't have my notes here on which one triggered the bug).
(In reply to Ted Campbell [:tcampbell] from comment #53)
> I get 00000007 and 10008000 on this one laptop, indicating it should already
> be fixed. I'll check the other laptop on Monday. (I'm don't have my notes
> here on which one triggered the bug).

Loool. No **** then you wasn't experiencing the issue!!
I must correct though. 
In addition to 00000003, even 00000001 signals a bug. 
http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=9
[Tracking Requested - why for this release]:

This 64-bit EnterBaseline JIT crash is 57.0 content process top crash #2.
Ted, Release Management plans to ship a 57.0.1 dot release this week. Do you think we should temporarily disable the JIT for users on these AMD CPUs in 57.0.1? You mentioned on Slack that some people are hitting this crash 10+ times a day.
Flags: needinfo?(tcampbell)
Ted says we shouldn't rush a fix for 57.0.1. This is sensitive code and, while we are receiving many crash reports, the actual number of affected users is relatively small. On Beta 58, for example, we have about 5000 reports from just 150 installs.
Flags: needinfo?(tcampbell)
It is looking like only a fraction users of this processor are hitting the problem badly. We don't have a reliable test for identifying which are affected, so it is unwise to risk hurting perf of anyone using this processor. People who are affected should use 32-bit builds.

The second sample laptop I have does report (via privileged mode tests that mirh described above) as having bad silicon. I'm continuing to try and reproduce the issue more than once.

We should continue to track this issue, but fixes should be attempted in beta first.
[Tracking Requested - why for this release]:
Actually, rates in beta are much lower than I thought. 4000/4700 crashes in last week are a single user hitting a crash report issue that caused it to submit a single crash over and over again. Once correcting for that, the AMD bug is still the cause of 100 crashes a day on beta 58. These AMD crashes are still almost all 64-bit. We should still monitor this bug, but it isn't really #2 crash on beta.
Depends on: 1421445
Not sure we should rush anything across trains here.
Assignee: jdemooij → tcampbell
Assignee: tcampbell → nobody
Blocks: SadJit

The TypeMonitor IC stubs have been a large source of where these crashes appear. The ubiquitous nature of them and the specific control flow pattern seems to really exacerbate the the CPU defect.

With Bug 1613592, we have disabled the use of these ICs, and with Bug 1673553 we have removed the code altogether. As a result, we see a significant (10x) reduction in these JIT crashes on AMD Bobcat machines starting with FF83. These crashes were about 1/3rd of JIT crashes across all desktop CPUs so, this will be apparent in those stats as well.

https://crash-stats.mozilla.org/signature/?release_channel=release&product=Firefox&cpu_info=~family%2020&signature=js%3A%3Ajit%3A%3AMaybeEnterJit&date=%3E%3D2020-09-07T13%3A49%3A00.000Z&date=%3C2020-12-07T13%3A49%3A00.000Z#graphs

Closing given comment 62.

Status: REOPENED → RESOLVED
Closed: 8 years ago3 years ago
Resolution: --- → FIXED
Assignee: nobody → jdemooij
You need to log in before you can comment on or make changes to this bug.