Closed Bug 1461427 Opened 6 years ago Closed 5 years ago

[meta] JIT Crash review - May 2018

Categories

(Core :: JavaScript Engine: JIT, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: tcampbell, Assigned: tcampbell)

References

Details

(Keywords: meta, sec-audit)

Bug 858032 is our meta bug for tracking crashes inside JIT code. Unfortunately, that is a broad grouping of crashes and a lot of things we've considered unactionable. Due to renaming functions and transient problems, we've had trouble finding actionable items here.

Here, I'll include my current analysis of crashes we are seeing for FF 60 - 62.
Crashes on these signatures have been quite stable over the last 3 months. I did a case study of FF 60.0b16, comparing with the work that Jan did in https://bugzilla.mozilla.org/show_bug.cgi?id=1034706#c44.

The following are a list of some of the patterns of crashes we experience (the list is growing). Since this crash signature is so general, we have a lot of noise from crashes that appear to be hardware failure.

1) Crashes on AMD Bobcat CPUs
- This is tracked in Bug 1281759
- Roughly 30% of JIT crashes currently are on these machines
- These correspond to a few defective AMD CPUs which have given us a lot of problems over the years. One option we are considering here is disabling the most-affected IC. There is a serious perf concern that needs to be measured and it is not clear that is a good thing to trade for stability.

2) Single bit flip in JIT code
- Jan observed these in previous review
- A single bit is flipped (usually 1 -> 0) in the instruction stream and the bit error appears in the dump
- We lean to believing these are DRAM failures, but we are keeping an eye out for other sources.
- Question: What is the window-of-opportunity between generating machine code and write-protecting the page?
- Some ways this crash manifests:
  - Spectre mitigations can generate identical comparisons, but instead we see a bit flipped in values. https://crash-stats.mozilla.org/report/index/ad9e0965-19da-40b9-b462-b49e50180429
  - Accessing ICEntry -> ICStub -> JitCodeRaw crashes on nullptr due to bit flip in baked-in pointer. https://crash-stats.mozilla.org/report/index/c087ae90-1119-4180-84fd-c5afd0180429
  - Crash on |00 00| due to a flipped bit throwing off instruction decode sequence. https://crash-stats.mozilla.org/report/index/da03ae4e-d597-4913-8a48-5a1030180430
  - Access violation due to wrong register due to bit flip in instruction encoding. https://crash-stats.mozilla.org/report/index/e394bf29-ac83-453b-b45d-46a850180428

3) Single bit flip (transient)
- Some crashes would be explained as a single bit flip while the minidump shows correct data
- These are almost certainly hardware failures

4) 16 bytes (aligned) of JIT code clobbered to zero
- https://crash-stats.mozilla.org/report/index/928547a7-5105-4bd6-a33b-20ada0180511
- I've mainly observed this at the start of JIT code. The |00 00| instruction stream generates a write violation where RAX is typically RIP since we used that to dispatch to the JIT code. The JitCodeHeader (at a negative offset) is valid, and later machine code is valid. The clobber is the same on 32-bit and 64-bit.
- This is suspicious!

5) 4-8 bytes of JIT code clobbered
- https://crash-stats.mozilla.org/report/index/60ec93d3-7e38-4a15-85a8-ef3850180430
- Usually to zero, but not always
- Often manifests as a write violation due to |00 00| instruction stream
- This is suspicious!

6) ARM64 Crashes with 1MB aligned address
- This is tracked in Bug 1461480
- 32-bit ARM and Desktop don't seem to have this pattern

7) Deref ObjectValue(nullptr) crash
- Jan observed this before
- This would be what we expect of a logic error
- Crash rate is quite low, and if it is a basic null deref then probably not scary.


I'm still doing further investigation in some of the rarer and scarier looking crashes. Many of the reasons cases above can manifest as wildptr/write/exec crashes and are hiding other more interesting crashes. There are also hints of some UAF-looking crashes affecting Ryzen that needs more investigation.
Assignee: nobody → tcampbell
Great analysis, thanks for looking into this!
> 5) 4-8 bytes of JIT code clobbered
- These crashes are aligned to the start of a 4kB page
8) Ryzen crashes
- https://crash-stats.mozilla.org/report/index/ccd54dbf-1eac-4e16-909d-ceb180180514
- The JIT crashes on Ryzen seem to be mostly bad hardware. This includes crashes on register moves and crashes on branches within the same page. The minidump shows correct data, so these seem like transient hardware.
- The working theory is these are badly installed heat-sinks and people overclocking new toys.

9) Sony phone on ARM
- Bug 1461724
- There is a high rate of Sony specific crashes. There is like a kernel config they use that is causing us trouble.
> 8) Ryzen crashes
- A reported (See Bug 1453625) has been observing ASAN violations while fuzzing on Ryzen machines. We have not been able to reproduce those cases. I asked them to perform a memtest and this was their feedback:
"I did it (there are no errors), the same issue was repeated in different machines. The only thing they share is similar HW (AMD Ryzen 7 1700X)."
- There might be something more subtle here.
Blocks: 1463654
Keywords: meta
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Group: javascript-core-security → core-security-release
Group: core-security-release
Summary: JIT Crash review - May 2018 → [meta] JIT Crash review - May 2018
You need to log in before you can comment on or make changes to this bug.