Closed Bug 1421445 Opened 2 years ago Closed 2 years ago

AMD Bobcat 64-bit crash mitigation


(Core :: JavaScript Engine: JIT, defect, P3)




Tracking Status
firefox57 --- wontfix
firefox58 --- fixed
firefox59 --- fixed


(Reporter: tcampbell, Assigned: tcampbell)


(Blocks 1 open bug)



(3 files, 4 obsolete files)

As an attempt to reduce crashes in Bug 1281759, I propose adding NOPs before JIT code buffers. Looking at the 64-bit crashes on FF57 / FF58, the reported fault instruction is a few bytes before the code starts in the middle of the JitCode* pointer.

This patch observes that we over-allocate those buffers by sizeof(void*) due to incorrect alignment math. By stealing that data to fill with NOPs we remain size neutral and need fewer checks if CPU is affected. In the future, if this fix does not provide stability improvements, we remove the NOPs and have a size reduction.

Looking at crash data, it seems that most of the WRITE faults are crashing writing to address of %rax, which is consistent with trying to execute the JitCode* pointer as code. Currently, 75% of the crashes we see for this 64-bit FF57 AMD Bobcat signature are WRITE faults.
Add JitCodeHeader struct and fix allocation size computation to save one word per jit code buffer.
Add NOP-slide before jit code buffers. This is size-neutral when including previous patch. As a result, I don't try and be too clever about only applying this to affected machines.
Assignee: nobody → tcampbell
Attachment #8932634 - Flags: feedback?(jdemooij)
Comment on attachment 8932634 [details] [diff] [review]

Review of attachment 8932634 [details] [diff] [review]:

Makes sense.

::: js/src/jit/Ion.cpp
@@ +739,5 @@
>      return trampolineCode(p->value());
>  }
> +void
> +JitCodeHeader::init(JitCode* jitCode) {

Nit: { on its own line.
Attachment #8932634 - Flags: feedback?(jdemooij) → feedback+
Cleaned up assert to check end of buffer is within range.
Attachment #8932633 - Attachment is obsolete: true
Attachment #8932919 - Flags: review?(jdemooij)
Fixed parens/whitespace.
Attachment #8932634 - Attachment is obsolete: true
Attachment #8932920 - Flags: review?(jdemooij)
Oops.. s/code/codeStart in the macro.
Attachment #8932919 - Attachment is obsolete: true
Attachment #8932919 - Flags: review?(jdemooij)
Attachment #8932923 - Flags: review?(jdemooij)
Comment on attachment 8932923 [details] [diff] [review]

Review of attachment 8932923 [details] [diff] [review]:

::: js/src/jit/Linker.cpp
@@ +51,4 @@
>      // Bump the code up to a nice alignment.
>      codeStart = (uint8_t*)AlignBytes((uintptr_t)codeStart, CodeAlignment);
> +    MOZ_ASSERT(codeStart + masm.bytesNeeded() <= result + bytesNeeded);

Good to have this assert.
Attachment #8932923 - Flags: review?(jdemooij) → review+
Attachment #8932920 - Flags: review?(jdemooij) → review+
Try run from the final patches (with s/code/codeStart/ fix):
Pushed by
Don't waste space allocating jit code buffer. r=jandem
Add NOP-slide before JIT code buffers on AMD Bobcat. r=jandem
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Reopening bug because the patches were backed out on inbound. The patches have been merged to central, but the backout hasn't yet.
Resolution: FIXED → ---
Sorry about that and thanks for already reopening.
Target Milestone: mozilla59 → ---
Priority: -- → P3
The Valgrind failure is that we pack instructions to close to the following JitCodeHeader data leading to decoding garbage and walking off the page and SEGV-ing Valgrind itself.

This patch adds a word of padding at end of JitCode buffer, but only on MOZ_VALGRIND.

Jan, does this seem too questionable to land? I'll follow up on the valgrind side to see if it should be fixed.
Flags: needinfo?(tcampbell)
Attachment #8935645 - Flags: review?(jdemooij)
Another point that :pbone made is to check the Intel guides on recommendations for padding between code and data because they recommend not having garbage nearby.
Uhm, btw, could you add a check for linux kernel <4.14, alongside cpuid check, in NeedAmdBugWorkaround function?
For reference, this is the kernel patch that mirh is referring to:

Also for reference, this is what :pbone is talking about in Intel guides:

> Assembly/Compiler Coding Rule 57. (M impact, L generality) If (hopefully read-only) data must
> occur on the same page as code, avoid placing it immediately after an indirect jump. For example,
> follow an indirect jump with its mostly likely target, and place the data after an unconditional branch.

After some discussion with :sewardj, it seems the Valgrind failure deserves further review and the assessment in Comment 15 is not correct. Valgrind shouldn't be looking any further than the basic block.
Comment on attachment 8935645 [details] [diff] [review]

This patch is not the right approach. I have an alternate that only puts the pad at the end of ExecutableAllocator pool that passes valgrind.
Attachment #8935645 - Attachment is obsolete: true
Attachment #8935645 - Flags: review?(jdemooij)
This patch add's UD2 ops at the end of JitCode to avoid perf impact of indirect branch followed by data bytes. This also mitigates the Valgrind crash.
Attachment #8936677 - Flags: review?(jdemooij)
Comment on attachment 8936677 [details] [diff] [review]

Review of attachment 8936677 [details] [diff] [review]:

Attachment #8936677 - Flags: review?(jdemooij) → review+
Pushed by
Don't waste space allocating jit code buffer. r=jandem
Add NOP-slide before JIT code buffers on AMD Bobcat. r=jandem
Insert UD2 instructions after JitCode. r=jandem
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Is that something you would like to uplift in 58?
Flags: needinfo?(tcampbell)
Comment on attachment 8932923 [details] [diff] [review]

Approval Request Comment
[Feature/Bug causing the regression]:
AMD Bobcat CPU bug
[User impact if declined]:
This attempts to mitigate some of the AMD Bobcat crash signatures which are the top 64-bit crash on 57 and 58.
[Is this code covered by automated tests?]:
Automated tests cover regressions on normal CPUs. There is no automated test for the specific CPU fault.
[Has the fix been verified in Nightly?]:
Nightly has not regressed with these patches.
[Needs manual test from QE? If yes, steps to reproduce]:
No. We will watch the crash stats after landing.
[List of other uplifts needed for the feature/fix]:
All three patches on this bug are needed.
[Is the change risky?]:
[Why is the change risky/not risky?]:
We adjust memory layout of a key structure, which is a risk. This code is heavily covered by many existing test suites so if nightly is stable, I don't expect problems in uplift.
[String changes made/needed]:
Flags: needinfo?(tcampbell)
Attachment #8932923 - Flags: approval-mozilla-beta?
Comment on attachment 8932923 [details] [diff] [review]

Sounds promising to fix a top crash in 58, let's uplift for the 58.0b12 build next week.
Attachment #8932923 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Depends on: 1425413
Ted, do you expect that we will see any change in the AMD crash volume from Nightly users if this fix is working? Or are there too few AMD Bobcat users running Nightly?

In the four days since this fix landed in Nightly 59 (2017-12-14), there have been 103 EnterJit and 16 EnterBaseline crashes from Win64 Firefox users. In the four days preceding the fix, there were 104 EnterJit and zero EnterBaseline crashes.
Flags: needinfo?(tcampbell)
Ted and I chatted on Slack. He said the mitigation didn't fix the AMD crashes. It just made the crashes weirder, such as impossible faults in register move instructions. He doesn't think we should back out the code change. It cleans up the JIT code and it shouldn't make these AMD crashes worse.
Flags: needinfo?(tcampbell)
Any update on this?
Might KAISER/Spectre/Meltdown workarounds-mitigation you or OS implement help (or worsen?) the issue?

Also before we all forget again, adding that little check I suggest in comment 18 wouldn't hurt.
You need to log in before you can comment on or make changes to this bug.