Closed Bug 1421445 Opened 2 years ago Closed 2 years ago

AMD Bobcat 64-bit crash mitigation

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox57 --- wontfix
firefox58 --- fixed
firefox59 --- fixed

People

(Reporter: tcampbell, Assigned: tcampbell)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 4 obsolete files)

As an attempt to reduce crashes in Bug 1281759, I propose adding NOPs before JIT code buffers. Looking at the 64-bit crashes on FF57 / FF58, the reported fault instruction is a few bytes before the code starts in the middle of the JitCode* pointer.

This patch observes that we over-allocate those buffers by sizeof(void*) due to incorrect alignment math. By stealing that data to fill with NOPs we remain size neutral and need fewer checks if CPU is affected. In the future, if this fix does not provide stability improvements, we remove the NOPs and have a size reduction.

Looking at crash data, it seems that most of the WRITE faults are crashing writing to address of %rax, which is consistent with trying to execute the JitCode* pointer as code. Currently, 75% of the crashes we see for this 64-bit FF57 AMD Bobcat signature are WRITE faults.
Add JitCodeHeader struct and fix allocation size computation to save one word per jit code buffer.
Add NOP-slide before jit code buffers. This is size-neutral when including previous patch. As a result, I don't try and be too clever about only applying this to affected machines.
Assignee: nobody → tcampbell
Status: NEW → ASSIGNED
Attachment #8932634 - Flags: feedback?(jdemooij)
Comment on attachment 8932634 [details] [diff] [review]
0002-Bug-1421445-Add-NOP-slide-before-JIT-code-buffers-on.patch

Review of attachment 8932634 [details] [diff] [review]:
-----------------------------------------------------------------

Makes sense.

::: js/src/jit/Ion.cpp
@@ +739,5 @@
>      return trampolineCode(p->value());
>  }
>  
> +void
> +JitCodeHeader::init(JitCode* jitCode) {

Nit: { on its own line.
Attachment #8932634 - Flags: feedback?(jdemooij) → feedback+
Cleaned up assert to check end of buffer is within range.
Attachment #8932633 - Attachment is obsolete: true
Attachment #8932919 - Flags: review?(jdemooij)
Fixed parens/whitespace.
Attachment #8932634 - Attachment is obsolete: true
Attachment #8932920 - Flags: review?(jdemooij)
Oops.. s/code/codeStart in the macro.
Attachment #8932919 - Attachment is obsolete: true
Attachment #8932919 - Flags: review?(jdemooij)
Attachment #8932923 - Flags: review?(jdemooij)
Comment on attachment 8932923 [details] [diff] [review]
0001-Bug-1421445-Don-t-waste-space-allocating-jit-code-bu.patch

Review of attachment 8932923 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/Linker.cpp
@@ +51,4 @@
>  
>      // Bump the code up to a nice alignment.
>      codeStart = (uint8_t*)AlignBytes((uintptr_t)codeStart, CodeAlignment);
> +    MOZ_ASSERT(codeStart + masm.bytesNeeded() <= result + bytesNeeded);

Good to have this assert.
Attachment #8932923 - Flags: review?(jdemooij) → review+
Attachment #8932920 - Flags: review?(jdemooij) → review+
Try run from the final patches (with s/code/codeStart/ fix): https://treeherder.mozilla.org/#/jobs?repo=try&revision=67a3d0645a7fa634429fcc95bf51b6c9fc3c316f
Pushed by tcampbell@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/797d9e71b648
Don't waste space allocating jit code buffer. r=jandem
https://hg.mozilla.org/integration/mozilla-inbound/rev/39f221c258fc
Add NOP-slide before JIT code buffers on AMD Bobcat. r=jandem
https://hg.mozilla.org/mozilla-central/rev/797d9e71b648
https://hg.mozilla.org/mozilla-central/rev/39f221c258fc
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Reopening bug because the patches were backed out on inbound. The patches have been merged to central, but the backout hasn't yet.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Sorry about that and thanks for already reopening.

https://hg.mozilla.org/mozilla-central/rev/33826eb2a216
Target Milestone: mozilla59 → ---
Priority: -- → P3
The Valgrind failure is that we pack instructions to close to the following JitCodeHeader data leading to decoding garbage and walking off the page and SEGV-ing Valgrind itself.

This patch adds a word of padding at end of JitCode buffer, but only on MOZ_VALGRIND.

Jan, does this seem too questionable to land? I'll follow up on the valgrind side to see if it should be fixed.
Flags: needinfo?(tcampbell)
Attachment #8935645 - Flags: review?(jdemooij)
Another point that :pbone made is to check the Intel guides on recommendations for padding between code and data because they recommend not having garbage nearby.
Uhm, btw, could you add a check for linux kernel <4.14, alongside cpuid check, in NeedAmdBugWorkaround function?
For reference, this is the kernel patch that mirh is referring to: https://github.com/torvalds/linux/commit/bfc1168de949cd3e9ca18c3480b5085deff1ea7c


Also for reference, this is what :pbone is talking about in Intel guides:

> Assembly/Compiler Coding Rule 57. (M impact, L generality) If (hopefully read-only) data must
> occur on the same page as code, avoid placing it immediately after an indirect jump. For example,
> follow an indirect jump with its mostly likely target, and place the data after an unconditional branch.


After some discussion with :sewardj, it seems the Valgrind failure deserves further review and the assessment in Comment 15 is not correct. Valgrind shouldn't be looking any further than the basic block.
Comment on attachment 8935645 [details] [diff] [review]
0001-Bug-1421445-Work-around-Valgrind-bug-concerning-JitC.patch

This patch is not the right approach. I have an alternate that only puts the pad at the end of ExecutableAllocator pool that passes valgrind.
Attachment #8935645 - Attachment is obsolete: true
Attachment #8935645 - Flags: review?(jdemooij)
This patch add's UD2 ops at the end of JitCode to avoid perf impact of indirect branch followed by data bytes. This also mitigates the Valgrind crash.
Attachment #8936677 - Flags: review?(jdemooij)
Comment on attachment 8936677 [details] [diff] [review]
0003-Bug-1421445-Insert-UD2-instructions-after-JitCode.patch

Review of attachment 8936677 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent.
Attachment #8936677 - Flags: review?(jdemooij) → review+
Pushed by tcampbell@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a031f41d0fb9
Don't waste space allocating jit code buffer. r=jandem
https://hg.mozilla.org/integration/mozilla-inbound/rev/90a8cf71fa46
Add NOP-slide before JIT code buffers on AMD Bobcat. r=jandem
https://hg.mozilla.org/integration/mozilla-inbound/rev/af47243ec0d7
Insert UD2 instructions after JitCode. r=jandem
https://hg.mozilla.org/mozilla-central/rev/a031f41d0fb9
https://hg.mozilla.org/mozilla-central/rev/90a8cf71fa46
https://hg.mozilla.org/mozilla-central/rev/af47243ec0d7
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Is that something you would like to uplift in 58?
Flags: needinfo?(tcampbell)
Comment on attachment 8932923 [details] [diff] [review]
0001-Bug-1421445-Don-t-waste-space-allocating-jit-code-bu.patch

Approval Request Comment
[Feature/Bug causing the regression]:
AMD Bobcat CPU bug
[User impact if declined]:
This attempts to mitigate some of the AMD Bobcat crash signatures which are the top 64-bit crash on 57 and 58.
[Is this code covered by automated tests?]:
Automated tests cover regressions on normal CPUs. There is no automated test for the specific CPU fault.
[Has the fix been verified in Nightly?]:
Nightly has not regressed with these patches.
[Needs manual test from QE? If yes, steps to reproduce]:
No. We will watch the crash stats after landing.
[List of other uplifts needed for the feature/fix]:
All three patches on this bug are needed.
[Is the change risky?]:
Low-moderate
[Why is the change risky/not risky?]:
We adjust memory layout of a key structure, which is a risk. This code is heavily covered by many existing test suites so if nightly is stable, I don't expect problems in uplift.
[String changes made/needed]:
None
Flags: needinfo?(tcampbell)
Attachment #8932923 - Flags: approval-mozilla-beta?
Comment on attachment 8932923 [details] [diff] [review]
0001-Bug-1421445-Don-t-waste-space-allocating-jit-code-bu.patch

Sounds promising to fix a top crash in 58, let's uplift for the 58.0b12 build next week.
Attachment #8932923 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Depends on: 1425413
Ted, do you expect that we will see any change in the AMD crash volume from Nightly users if this fix is working? Or are there too few AMD Bobcat users running Nightly?

In the four days since this fix landed in Nightly 59 (2017-12-14), there have been 103 EnterJit and 16 EnterBaseline crashes from Win64 Firefox users. In the four days preceding the fix, there were 104 EnterJit and zero EnterBaseline crashes.
Flags: needinfo?(tcampbell)
Ted and I chatted on Slack. He said the mitigation didn't fix the AMD crashes. It just made the crashes weirder, such as impossible faults in register move instructions. He doesn't think we should back out the code change. It cleans up the JIT code and it shouldn't make these AMD crashes worse.
Flags: needinfo?(tcampbell)
Any update on this?
Might KAISER/Spectre/Meltdown workarounds-mitigation you or OS implement help (or worsen?) the issue?

Also before we all forget again, adding that little check I suggest in comment 18 wouldn't hurt.
You need to log in before you can comment on or make changes to this bug.