Closed Bug 1833315 Opened 1 year ago Closed 1 year ago

Crash in [@ js::StackUses] in js::frontend::BytecodeSection::updateDepth(js::frontend::BytecodeOffset) (Samsung CPU Issue)

Categories

(Core :: JavaScript Engine, defect, P3)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
116 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox113 --- wontfix
firefox114 --- fixed
firefox115 --- fixed
firefox116 --- fixed

People

(Reporter: cpeterson, Assigned: mgaudet)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

This crash looks like a regression in Fenix 106 or a new crash signature from function inlining?

Crash report: https://crash-stats.mozilla.org/report/index/f585f1a9-0ed8-4dce-bc64-be8aa0230516

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0  libxul.so  js::StackUses  js/src/vm/BytecodeUtil.h:333
0  libxul.so  js::frontend::BytecodeSection::updateDepth  js/src/frontend/BytecodeSection.cpp:176
0  libxul.so  js::frontend::BytecodeEmitter::emitN  js/src/frontend/BytecodeEmitter.cpp:369
0  libxul.so  js::frontend::BytecodeEmitter::emitEnvCoordOp  js/src/frontend/BytecodeEmitter.cpp:792
0  libxul.so  js::frontend::NameOpEmitter::emitGet  js/src/frontend/NameOpEmitter.cpp:99
0  libxul.so  js::frontend::BytecodeEmitter::emitGetNameAtLocation  js/src/frontend/BytecodeEmitter.cpp:1537
0  libxul.so  js::frontend::BytecodeEmitter::emitGetName  js/src/frontend/BytecodeEmitter.h:690
0  libxul.so  js::frontend::BytecodeEmitter::emitGetName  js/src/frontend/BytecodeEmitter.cpp:1547
0  libxul.so  js::frontend::BytecodeEmitter::emitTree  js/src/frontend/BytecodeEmitter.cpp:11692
1  libxul.so  js::frontend::BytecodeEmitter::emitPropLHS  js/src/frontend/BytecodeEmitter.cpp:1635

Matthew, I think you are already looking at this issue ;)

Blocks: sm-frontend
Severity: -- → S4
Flags: needinfo?(mgaudet)
Priority: -- → P3

This signature is covering three different crashes at least.

Looking at crash stats I'll note that of ~200 crashes, 95% are coming from Samsung mobiles. Almost all the samsung crashes have crash addresses ended in fa9 -- which feels important but I'm not sure what it means; they all appear to come to the same build IDs. Looking at the registers, there's a register with the crashing address (but aligned, ending in fa8). Unfortunately I can't see what instruction is being used to do the load, so I'm not sure if this is like 1-past the end of a buffer or if we're doing an unaligned load where we ought to not be.

There's also a series of crashes at 0x0000087c, all with the same install time (2023-05-09 17:49:58) -- these can be disregarded as bad hardware I suspect.

At least one of these crashes from windows is pretty suspicious. As much as the stack looks very real, it's reporting a MOZ_RELEASE_ASSERT(pn_type < ParseNodeKind::Limit), which is not used particularly nearby; it's unclear why this is the case.

Flags: needinfo?(mgaudet)

This doesn't feel like an S4 to me given the crash volume. This is currently the #4 overall content process crash on Fenix release.

Flags: needinfo?(nicolas.b.pierron)

I'd be interested in Iain's thoughts on these crash patterns.

Severity: S4 → --
Flags: needinfo?(iireland)

(Actually -- he's on PTO, nevermind. I'll take another look today)

Flags: needinfo?(iireland) → needinfo?(mgaudet)

Looking into the fa9 crashes, which are still 95% of the crashes under this signature.

This has however persisted across 3 builds.

Flags: needinfo?(mgaudet)

I think I've finally managed to get the instructions out of the minidump. It says the faulting instruction pointer is 0x6c1b455dc0 and here's what I've gotten with some work

0x6c1b455da0	ldr x10, [x27, #0x28]
0x6c1b455da4	strb w23, [x10, x24]
0x6c1b455da8	ldrsb w10, [x9, #1]
0x6c1b455dac	ldr x9, [x27, #0x28]
0x6c1b455db0	tbnz w10, #0x1f, #0x6c1b455de8
0x6c1b455db4	add x11, x9, x24
0x6c1b455db8	ldrb w12, [x11]
0x6c1b455dbc	add x10, x19, x12, lsl #3
0x6c1b455dc0	ldrsb w13, [x10, #1]   <-------- FAULTING INSTRUCTION HERE.
0x6c1b455dc4	tbnz w13, #0x1f, #0x6c1b4591ac
0x6c1b455dc8	and w11, w13, #0xff
0x6c1b455dcc	ldp w13, w12, [x27, #0xf0]
0x6c1b455dd0	ldrsb w10, [x10, #2]
0x6c1b455dd4	sub w11, w12, w11

So the unaligned access of x10+1 is coming directly from the assembly. The origin of that value is x19+ x12 << 3 (previous instruction). Sure enough, doing this calculation using the values from one of the dumps seems to confirm that what I'm seeing is real:

js> (0x000000700d6711d8 + (0x00000000ffffffba << 3)).toString(16)
'700d670fa8'

Best guess is that these instructions are implementing

   JSOp op = JSOp(*pc);
  int nuses = CodeSpec(op).nuses;

from js::StackUses

Reading the back trace, it seems like we're emitting either a GetAliasedVar or GetAliasedDebugVar; we then call into emitN where we set the initial value of the bytecode pointer.

Next we call updateDepth at that offset -- this reads back the immediately prior written bytecode. I don't know why this doesn't work on samsung phones, but I'll bet that this is the root cause -- see, under this theory, x12 should be our JSOp -- but the actual value in the register we have is 0x00000000ffffffba which is way out of bounds for our bytecode count.

Why this is happening now, in 113 where it didn't happen previously is totally unclear. For some reason clang is now emitting a sequence that is tripping this hardware issue?

I've no idea how to resolve this. Patching this flow to avoid the read-after-write is awkward.

I tried to reverse engineer the assembly, but I do not see what it might corresponds to within StackUses:

  unknown64_t* x19 = /* ??? */;
  size_t x12 = /* ??? */;
  unknown64_t* x10 = &x19[x12];
  int8_t w13 = x10.one_byte_within;
  if (w13 % 32) {  // Potentially a switch statement condition with cases which are all below 32.
    return; // jump far further.
  }
  uint8_t w11 = w13 & 0xff;
  uint64_t (w12, w13) = x27.some_field_in_large_struct; // offset of field 0xf0
  int8_t w10 = x10.two_bytes_within;
  uint16_t w11 = w12 - w11;  // x27.some_field_in_large_struct.first - x10.one_byte_within & 0xff;

At first I thought x10 could be the CodeSpec, but I would not expect any condition to be added after loading nuses.
Then I wondered if the condition could be the switch statement, but some of the matched opcodes are larger than 32.
I would not expect any of of the complex code path from GET_UINT16.

So I do not think this code belongs to StackUses as reported in the stack trace.

Right, this is potentially the nuses field which is being repeated before being tested elsewhere.

Just in case it's helpful, i've uploaded the -whole- disassembly I generated; the version in the comment was a snippet.

I am not sure what the assembly maps to. x19 is table of uint64_t struct indexed by a uint8_t. This would sounds like this is a table indexed by opcodes, but there are 2 occurrences where the field at offset 1 is compared against 32 (_ % 32 == 0, or _ & 31 == 0) at 0x6c1b455db0 and 0x6c1b455dc4.
I do not know from where these comparisons are coming from.

Also, given that 0x6c1b455da8 works, this is sounds quite unlikely that a failure would happen at 0x6c1b455dc0, especially since LRDB at 0x6c1b455db8 is suppose to read a single byte.

Could this be a sign issue? LDRB doing a sign extension in w12, which then causes problem when being used by reading the value from in incorrect memory page?

(In reply to Matthew Gaudet (he/him) [:mgaudet] from comment #7)

js> (0x000000700d6711d8 + (0x00000000ffffffba << 3)).toString(16)
'700d670fa8'

Be careful JS shift operators only work with 32 bits values, and not 64 bits. Thus this is not the correct result here.

I'm at my own wits end on this one; I'm going to leave this one in your hands

Poked at this a bit. It looks like everything's all been inlined together into one big pile of instructions and interleaved, but I think I've made a bit of progress.

Consider these instructions from the beginning of Matt's assembly sequence:

0x6c1b455d50	cmp w8, #6    
0x6c1b455d54	movn w9, #0x45   
0x6c1b455d58	cinc w23, w9, ne 

We are loading the constant ~0x45, then incrementing it if w8 is not 6. Interestingly, ~0x45 is 0xffffffba, which is precisely the mystery value that Matt found in x12. Later on, we have these instructions:

0x6c1b455d7c	adrp x19, #0x6c1d6c5000
0x6c1b455d80	add x19, x19, #0x1d8
...
0x6c1b455d94	and x9, x23, #0xff
...
0x6c1b455d9c	add x9, x19, x9, lsl #3
...
0x6c1b455da8	ldrsb w10, [x9, #1]

The first two instructions load a hardcoded pointer. The next two compute base_address + (our_mystery_register & 0xff) << 3. This looks like we're finding an entry from the CodeSpecTable. The last instruction loads (and sign-extends) a value from the CodeSpec: specifically, nuses is at offset 1.

0xffffffba & 0xff is 0xba, which is JSOp::GetAliasedVar. 0xbb is JSOp::GetAliasedDebugVar. So the first set of instructions I pulled out maps to this code, which we can confirm by noticing that NameLocation::Kind::EnvironmentCoordinate is 6, matching the hardcoded constant.

Staring at the rest of the code, I've convinced myself that x27 is the BytecodeEmitter, and it's being used to reference the vector of bytecode we're emitting. A BytecodeEmitter contains a BytecodeSection at offset 32, which contains a BytecodeVector at offset 0. Doing the math, this means that x27+0x28 is mBegin, x27+0x30 is the length, and x27+0x38 is the capacity. With that in mind, here's another interesting sequence of instructions. In this code, x23/w23 holds 0xffffffba.

0x6c1b455d4c	ldr x24, [x27, #0x30]     // x24 = offset (length of buffer before writing anything)
...	          
0x6c1b455da0	ldr x10, [x27, #0x28]     // x10 is address of buffer
0x6c1b455da4	strb w23, [x10, x24]      // write opcode in buffer at offset
...				          
0x6c1b455dac	ldr x9, [x27, #0x28]      // x9 is address of buffer
...				          
0x6c1b455db4	add x11, x9, x24          // x11 is offset of opcode we just wrote
0x6c1b455db8	ldrb w12, [x11]           // load the opcode
0x6c1b455dbc	add x10, x19, x12, lsl #3 // look it up in the codespec table
0x6c1b455dc0	ldrsb w13, [x10, #1]      // CRASH!

In short, we store the opcode in the buffer, then load it again to use as an index into the codespec table. The really interesting thing here is the ldrb. The ARM docs describe it as:

LDRB Wt, addr
Load Byte: loads a byte from memory addressed by addr, then zero-extends it to Wt.

Since we just stored 0xba, that's what x12 should contain. However, that's not what we're seeing. Looking at a dump, I see x12 = 0x00000000ffffffba. That's not zero-extended.

I have a strong suspicion that Samsung (or whoever designed this particular chip) messed up their store forwarding, and accidentally forwarded the entire register without clearing the upper bits.

Flags: needinfo?(nicolas.b.pierron)

Marking as S3 given that we've got some high-confidence this is a processor issue. The crash rate seems to have dropped a bit as well.

We'll need to investigate some mitigation if it keeps up

Severity: -- → S3

So this has started to spike even more.

At this point, it feels like we ought to do some refactoring to try to break sufficient dependencies that we no longer trigger this. I'll take a look at this shortly.

Severity: S3 → S2
Flags: needinfo?(mgaudet)
Summary: Crash in [@ js::StackUses] in js::frontend::BytecodeSection::updateDepth(js::frontend::BytecodeOffset) → Crash in [@ js::StackUses] in js::frontend::BytecodeSection::updateDepth(js::frontend::BytecodeOffset) (Samsung CPU Issue)

The goal of this patch is to avoid the bad store forwarding on
some Samsung phones

Assignee: nobody → mgaudet
Status: NEW → ASSIGNED
Flags: needinfo?(mgaudet)
Pushed by mgaudet@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c6a272593c07
Pass JSOp along rather than recomputing r=iain

Comment on attachment 9337950 [details]
Bug 1833315 - Pass JSOp along rather than recomputing r?iain

Beta/Release Uplift Approval Request

  • User impact if declined: Users of Samsung Galaxy S20s may continue to experience more crashes.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): A fairly trivial refactor intended to cause a hardware bug to no longer manifest
  • String changes made/needed:
  • Is Android affected?: Yes

Note: this patch is attempting to remove a bad access pattern that manifests exclusively on Samsung Galaxy S20 (and Note 20) hardware. It's hard to say if this will actually -fix- the issue, but it's hoped that this will remove the proximate cause.

Attachment #9337950 - Flags: approval-mozilla-release?
Attachment #9337950 - Flags: approval-mozilla-beta?
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch

Comment on attachment 9337950 [details]
Bug 1833315 - Pass JSOp along rather than recomputing r?iain

Approved for 115.0b4.

Attachment #9337950 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9337950 [details]
Bug 1833315 - Pass JSOp along rather than recomputing r?iain

Approved for 104.2.0 next week, thanks.

Attachment #9337950 - Flags: approval-mozilla-release? → approval-mozilla-release+
See Also: → 1847414
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: