Crash in [@ js::StackUses] in js::frontend::BytecodeSection::updateDepth(js::frontend::BytecodeOffset) (Samsung CPU Issue)
Categories
(Core :: JavaScript Engine, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox113 | --- | wontfix |
firefox114 | --- | fixed |
firefox115 | --- | fixed |
firefox116 | --- | fixed |
People
(Reporter: cpeterson, Assigned: mgaudet)
References
(Blocks 1 open bug)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files)
2.09 KB,
application/octet-stream
|
Details | |
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-beta+
pascalc
:
approval-mozilla-release+
|
Details | Review |
This crash looks like a regression in Fenix 106 or a new crash signature from function inlining?
Crash report: https://crash-stats.mozilla.org/report/index/f585f1a9-0ed8-4dce-bc64-be8aa0230516
Reason: SIGSEGV / SEGV_MAPERR
Top 10 frames of crashing thread:
0 libxul.so js::StackUses js/src/vm/BytecodeUtil.h:333
0 libxul.so js::frontend::BytecodeSection::updateDepth js/src/frontend/BytecodeSection.cpp:176
0 libxul.so js::frontend::BytecodeEmitter::emitN js/src/frontend/BytecodeEmitter.cpp:369
0 libxul.so js::frontend::BytecodeEmitter::emitEnvCoordOp js/src/frontend/BytecodeEmitter.cpp:792
0 libxul.so js::frontend::NameOpEmitter::emitGet js/src/frontend/NameOpEmitter.cpp:99
0 libxul.so js::frontend::BytecodeEmitter::emitGetNameAtLocation js/src/frontend/BytecodeEmitter.cpp:1537
0 libxul.so js::frontend::BytecodeEmitter::emitGetName js/src/frontend/BytecodeEmitter.h:690
0 libxul.so js::frontend::BytecodeEmitter::emitGetName js/src/frontend/BytecodeEmitter.cpp:1547
0 libxul.so js::frontend::BytecodeEmitter::emitTree js/src/frontend/BytecodeEmitter.cpp:11692
1 libxul.so js::frontend::BytecodeEmitter::emitPropLHS js/src/frontend/BytecodeEmitter.cpp:1635
Comment 1•1 year ago
|
||
Matthew, I think you are already looking at this issue ;)
Assignee | ||
Comment 2•1 year ago
|
||
This signature is covering three different crashes at least.
Looking at crash stats I'll note that of ~200 crashes, 95% are coming from Samsung mobiles. Almost all the samsung crashes have crash addresses ended in fa9
-- which feels important but I'm not sure what it means; they all appear to come to the same build IDs. Looking at the registers, there's a register with the crashing address (but aligned, ending in fa8
). Unfortunately I can't see what instruction is being used to do the load, so I'm not sure if this is like 1-past the end of a buffer or if we're doing an unaligned load where we ought to not be.
There's also a series of crashes at 0x0000087c
, all with the same install time (2023-05-09 17:49:58) -- these can be disregarded as bad hardware I suspect.
At least one of these crashes from windows is pretty suspicious. As much as the stack looks very real, it's reporting a MOZ_RELEASE_ASSERT(pn_type < ParseNodeKind::Limit)
, which is not used particularly nearby; it's unclear why this is the case.
Assignee | ||
Updated•1 year ago
|
Comment 3•1 year ago
|
||
This doesn't feel like an S4 to me given the crash volume. This is currently the #4 overall content process crash on Fenix release.
Assignee | ||
Comment 4•1 year ago
|
||
I'd be interested in Iain's thoughts on these crash patterns.
Assignee | ||
Comment 5•1 year ago
|
||
(Actually -- he's on PTO, nevermind. I'll take another look today)
Assignee | ||
Comment 6•1 year ago
|
||
Looking into the fa9 crashes, which are still 95% of the crashes under this signature.
- 100% are coming from Samsung mobiles.
- Every single one is a variant of the Samsung Galaxy 20 (or Note 20)
- 85% coming from the same proto-signature
- The next highest proto-signature is 10% and identical except for one instance of js::frontend::BytecodeEmitter::emitPropLHS.
This has however persisted across 3 builds.
Assignee | ||
Comment 7•1 year ago
|
||
I think I've finally managed to get the instructions out of the minidump. It says the faulting instruction pointer is 0x6c1b455dc0 and here's what I've gotten with some work
0x6c1b455da0 ldr x10, [x27, #0x28]
0x6c1b455da4 strb w23, [x10, x24]
0x6c1b455da8 ldrsb w10, [x9, #1]
0x6c1b455dac ldr x9, [x27, #0x28]
0x6c1b455db0 tbnz w10, #0x1f, #0x6c1b455de8
0x6c1b455db4 add x11, x9, x24
0x6c1b455db8 ldrb w12, [x11]
0x6c1b455dbc add x10, x19, x12, lsl #3
0x6c1b455dc0 ldrsb w13, [x10, #1] <-------- FAULTING INSTRUCTION HERE.
0x6c1b455dc4 tbnz w13, #0x1f, #0x6c1b4591ac
0x6c1b455dc8 and w11, w13, #0xff
0x6c1b455dcc ldp w13, w12, [x27, #0xf0]
0x6c1b455dd0 ldrsb w10, [x10, #2]
0x6c1b455dd4 sub w11, w12, w11
So the unaligned access of x10+1
is coming directly from the assembly. The origin of that value is x19+ x12 << 3
(previous instruction). Sure enough, doing this calculation using the values from one of the dumps seems to confirm that what I'm seeing is real:
js> (0x000000700d6711d8 + (0x00000000ffffffba << 3)).toString(16)
'700d670fa8'
Assignee | ||
Comment 8•1 year ago
|
||
Best guess is that these instructions are implementing
JSOp op = JSOp(*pc);
int nuses = CodeSpec(op).nuses;
from js::StackUses
Reading the back trace, it seems like we're emitting either a GetAliasedVar or GetAliasedDebugVar; we then call into emitN
where we set the initial value of the bytecode pointer.
Next we call updateDepth at that offset -- this reads back the immediately prior written bytecode. I don't know why this doesn't work on samsung phones, but I'll bet that this is the root cause -- see, under this theory, x12
should be our JSOp
-- but the actual value in the register we have is 0x00000000ffffffba
which is way out of bounds for our bytecode count.
Why this is happening now, in 113 where it didn't happen previously is totally unclear. For some reason clang is now emitting a sequence that is tripping this hardware issue?
I've no idea how to resolve this. Patching this flow to avoid the read-after-write is awkward.
Comment 9•1 year ago
•
|
||
I tried to reverse engineer the assembly, but I do not see what it might corresponds to within StackUses:
unknown64_t* x19 = /* ??? */;
size_t x12 = /* ??? */;
unknown64_t* x10 = &x19[x12];
int8_t w13 = x10.one_byte_within;
if (w13 % 32) { // Potentially a switch statement condition with cases which are all below 32.
return; // jump far further.
}
uint8_t w11 = w13 & 0xff;
uint64_t (w12, w13) = x27.some_field_in_large_struct; // offset of field 0xf0
int8_t w10 = x10.two_bytes_within;
uint16_t w11 = w12 - w11; // x27.some_field_in_large_struct.first - x10.one_byte_within & 0xff;
At first I thought x10 could be the CodeSpec, but I would not expect any condition to be added after loading nuses
.
Then I wondered if the condition could be the switch statement, but some of the matched opcodes are larger than 32.
I would not expect any of of the complex code path from GET_UINT16.
So I do not think this code belongs to StackUses as reported in the stack trace.
Right, this is potentially the nuses
field which is being repeated before being tested elsewhere.
Assignee | ||
Comment 10•1 year ago
|
||
Just in case it's helpful, i've uploaded the -whole- disassembly I generated; the version in the comment was a snippet.
Comment 11•1 year ago
|
||
I am not sure what the assembly maps to. x19 is table of uint64_t struct indexed by a uint8_t. This would sounds like this is a table indexed by opcodes, but there are 2 occurrences where the field at offset 1 is compared against 32 (_ % 32 == 0, or _ & 31 == 0) at 0x6c1b455db0 and 0x6c1b455dc4.
I do not know from where these comparisons are coming from.
Also, given that 0x6c1b455da8 works, this is sounds quite unlikely that a failure would happen at 0x6c1b455dc0, especially since LRDB at 0x6c1b455db8 is suppose to read a single byte.
Could this be a sign issue? LDRB doing a sign extension in w12, which then causes problem when being used by reading the value from in incorrect memory page?
(In reply to Matthew Gaudet (he/him) [:mgaudet] from comment #7)
js> (0x000000700d6711d8 + (0x00000000ffffffba << 3)).toString(16) '700d670fa8'
Be careful JS shift operators only work with 32 bits values, and not 64 bits. Thus this is not the correct result here.
Assignee | ||
Comment 12•1 year ago
|
||
I'm at my own wits end on this one; I'm going to leave this one in your hands
Comment 13•1 year ago
|
||
Poked at this a bit. It looks like everything's all been inlined together into one big pile of instructions and interleaved, but I think I've made a bit of progress.
Consider these instructions from the beginning of Matt's assembly sequence:
0x6c1b455d50 cmp w8, #6
0x6c1b455d54 movn w9, #0x45
0x6c1b455d58 cinc w23, w9, ne
We are loading the constant ~0x45, then incrementing it if w8 is not 6. Interestingly, ~0x45 is 0xffffffba, which is precisely the mystery value that Matt found in x12. Later on, we have these instructions:
0x6c1b455d7c adrp x19, #0x6c1d6c5000
0x6c1b455d80 add x19, x19, #0x1d8
...
0x6c1b455d94 and x9, x23, #0xff
...
0x6c1b455d9c add x9, x19, x9, lsl #3
...
0x6c1b455da8 ldrsb w10, [x9, #1]
The first two instructions load a hardcoded pointer. The next two compute base_address + (our_mystery_register & 0xff) << 3
. This looks like we're finding an entry from the CodeSpecTable. The last instruction loads (and sign-extends) a value from the CodeSpec: specifically, nuses is at offset 1.
0xffffffba & 0xff
is 0xba, which is JSOp::GetAliasedVar. 0xbb is JSOp::GetAliasedDebugVar. So the first set of instructions I pulled out maps to this code, which we can confirm by noticing that NameLocation::Kind::EnvironmentCoordinate is 6, matching the hardcoded constant.
Staring at the rest of the code, I've convinced myself that x27 is the BytecodeEmitter, and it's being used to reference the vector of bytecode we're emitting. A BytecodeEmitter contains a BytecodeSection at offset 32, which contains a BytecodeVector at offset 0. Doing the math, this means that x27+0x28 is mBegin, x27+0x30 is the length, and x27+0x38 is the capacity. With that in mind, here's another interesting sequence of instructions. In this code, x23/w23 holds 0xffffffba
.
0x6c1b455d4c ldr x24, [x27, #0x30] // x24 = offset (length of buffer before writing anything)
...
0x6c1b455da0 ldr x10, [x27, #0x28] // x10 is address of buffer
0x6c1b455da4 strb w23, [x10, x24] // write opcode in buffer at offset
...
0x6c1b455dac ldr x9, [x27, #0x28] // x9 is address of buffer
...
0x6c1b455db4 add x11, x9, x24 // x11 is offset of opcode we just wrote
0x6c1b455db8 ldrb w12, [x11] // load the opcode
0x6c1b455dbc add x10, x19, x12, lsl #3 // look it up in the codespec table
0x6c1b455dc0 ldrsb w13, [x10, #1] // CRASH!
In short, we store the opcode in the buffer, then load it again to use as an index into the codespec table. The really interesting thing here is the ldrb. The ARM docs describe it as:
LDRB Wt, addr
Load Byte: loads a byte from memory addressed by addr, then zero-extends it to Wt.
Since we just stored 0xba
, that's what x12 should contain. However, that's not what we're seeing. Looking at a dump, I see x12 = 0x00000000ffffffba
. That's not zero-extended.
I have a strong suspicion that Samsung (or whoever designed this particular chip) messed up their store forwarding, and accidentally forwarded the entire register without clearing the upper bits.
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 14•1 year ago
|
||
Marking as S3 given that we've got some high-confidence this is a processor issue. The crash rate seems to have dropped a bit as well.
We'll need to investigate some mitigation if it keeps up
Assignee | ||
Comment 15•1 year ago
|
||
So this has started to spike even more.
At this point, it feels like we ought to do some refactoring to try to break sufficient dependencies that we no longer trigger this. I'll take a look at this shortly.
Assignee | ||
Comment 16•1 year ago
|
||
The goal of this patch is to avoid the bad store forwarding on
some Samsung phones
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Comment 17•1 year ago
|
||
Assignee | ||
Comment 18•1 year ago
•
|
||
Comment on attachment 9337950 [details]
Bug 1833315 - Pass JSOp along rather than recomputing r?iain
Beta/Release Uplift Approval Request
- User impact if declined: Users of Samsung Galaxy S20s may continue to experience more crashes.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): A fairly trivial refactor intended to cause a hardware bug to no longer manifest
- String changes made/needed:
- Is Android affected?: Yes
Note: this patch is attempting to remove a bad access pattern that manifests exclusively on Samsung Galaxy S20 (and Note 20) hardware. It's hard to say if this will actually -fix- the issue, but it's hoped that this will remove the proximate cause.
Comment 19•1 year ago
|
||
bugherder |
Comment 20•1 year ago
|
||
Comment on attachment 9337950 [details]
Bug 1833315 - Pass JSOp along rather than recomputing r?iain
Approved for 115.0b4.
Comment 21•1 year ago
|
||
bugherder uplift |
Comment 22•1 year ago
|
||
Comment on attachment 9337950 [details]
Bug 1833315 - Pass JSOp along rather than recomputing r?iain
Approved for 104.2.0 next week, thanks.
Comment 23•1 year ago
|
||
bugherder uplift |
Description
•