Crash in [@ js::HeapSlot::init] with graphics adapter Adreno (TM) 530
Categories
(Core :: JavaScript Engine, defect, P3)
Tracking
()
People
(Reporter: cpeterson, Unassigned)
References
Details
(Keywords: crash)
Crash Data
Crash report: https://crash-stats.mozilla.org/report/index/8087ee04-b034-4afa-b29a-012990230806
Reason: SIGSEGV / SEGV_MAPERR
Top 10 frames of crashing thread:
0 libxul.so js::HeapSlot::init js/src/gc/Barrier.h:947
0 libxul.so js::NativeObject::initReservedSlot js/src/vm/NativeObject.h:1288
0 libxul.so js::BoundFunctionObject::functionBindImpl js/src/vm/BoundFunctionObject.cpp:373
1 libxul.so js::BoundFunctionObject::functionBind js/src/vm/BoundFunctionObject.cpp:286
2 libxul.so CallJSNative js/src/vm/Interpreter.cpp:486
2 libxul.so js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:580
2 libxul.so InternalCall js/src/vm/Interpreter.cpp:647
2 libxul.so js::Call js/src/vm/Interpreter.cpp:679
3 libxul.so js::fun_call js/src/vm/JSFunction.cpp:956
4 ? @0x0000007cf2956cd0
Reporter | ||
Comment 1•1 year ago
|
||
This isn't a new crash signature, but it appears there is a crash spike in Firefox Android 116.0.
Updated•1 year ago
|
Comment 2•1 year ago
|
||
The bug is marked as tracked for firefox116 (release). However, the bug still isn't assigned.
:sdetar, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit BugBot documentation.
Comment 3•1 year ago
|
||
This is a null deref crash.
It looks like all of the crashes have js::HeapSlot::init and BoundFunctionObject::functionBindImpl in their stack. I didn't see any code near the top of the stack that changed in 116.
Comment 4•1 year ago
|
||
Looking at the last month of Fenix crashes, for those with BoundFunctionObject::functionBindImpl in the proto stack, 93% of them are this crash, so it is not an obvious signature change.
Comment 5•1 year ago
|
||
ni? myself to try and get bytes out of the minidump.
This reminds me a bit of Bug 1833315, which was a CPU bug exposed through the compiler.
Comment 6•1 year ago
|
||
(In reply to Matthew Gaudet (he/him) [:mgaudet] from comment #5)
This reminds me a bit of Bug 1833315, which was a CPU bug exposed through the compiler.
Ah, good point. Looking at Summary tab for this signature, 99.8% of these crashes on Android have the graphics adapter "Adreno (TM) 530". Looking across all Fenix crashes, only 0.51% of crashes have that adapter.
In fact, it looks like about 40% of all crashes with that adapter have this CPU value.
Comment 7•1 year ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #6)
In fact, it looks like about 40% of all crashes with that adapter have this CPU value.
Sorry, I meant "about 40% of all crashes with that adapter have this signature". More specifically, there are about 1000 crashes total with this graphics adapter in the last 7 days, and around 450 crashes in the last 7 days with this signature.
Comment 8•1 year ago
|
||
Ok; getminidump-instructions just worked for me
Using disasm.pro I get the following disassembly up to the crashing instruction:
bl #0x6c1aa64d08
cbz x0, #0x6c1b456538
ldp x9, x8, [sp, #0x88]
str x0, [sp, #0x20]
str x8, [x9]
ldp x9, x8, [sp, #0x58]
str x8, [x9]
ldr x1, [x20]
adrp x26, #0x6c1f04d000
add x26, x26, #0xdd0
adrp x24, #0x6c1f04d000
add x24, x24, #0xb68
ldr x10, [x1]
ldr x8, [x10]
ldr x9, [x8]
ldr x8, [sp, #0x20]
cmp x9, x26
b.eq #0x6c1b455d90
cmp x9, x24
b.ne #0x6c1b4565c8
ldrb w9, [x1, #0x18]
lsr w0, w9, #7
lsl w9, w21, #1
movz x27, #0x8000, lsl #32
movk x27, #0xfff8, lsl #48
bfxil x9, x0, #0, #1
orr x9, x9, x27
str x9, [x8, #0x20]
ldr x8, [x20]
ldr x1, [sp, #0x20]
orr x9, x8, #0xfffe000000000000
and x8, x8, #0x7ffffffc0000
str x9, [x1, #0x18] <--- Crash here.
(Posting now, just headed into a meeting, haven't got a chance to actually read dump yet)
Comment 9•1 year ago
|
||
Taking a quick peek at this, while trying not to nerd-snipe myself too badly:
bl #0x6c1aa64d08 # call something
cbz x0, #0x6c1b456538 # check result
ldp x9, x8, [sp, #0x88]
str x0, [sp, #0x20] # save the result in the stack frame
str x8, [x9]
ldp x9, x8, [sp, #0x58]
str x8, [x9]
ldr x1, [x20]
adrp x26, #0x6c1f04d000
add x26, x26, #0xdd0
adrp x24, #0x6c1f04d000
add x24, x24, #0xb68
ldr x10, [x1]
ldr x8, [x10]
ldr x9, [x8]
ldr x8, [sp, #0x20] # load the result in x8
cmp x9, x26
b.eq #0x6c1b455d90
cmp x9, x24
b.ne #0x6c1b4565c8
ldrb w9, [x1, #0x18]
lsr w0, w9, #7
lsl w9, w21, #1
movz x27, #0x8000, lsl #32
movk x27, #0xfff8, lsl #48
bfxil x9, x0, #0, #1
orr x9, x9, x27
str x9, [x8, #0x20] # store a value at offset 0x20 from result
ldr x8, [x20]
ldr x1, [sp, #0x20] # load the result again
orr x9, x8, #0xfffe000000000000
and x8, x8, #0x7ffffffc0000
str x9, [x1, #0x18] # store a value at offset 0x18 from result
We start by doing a call, and checking whether the result is null. If it isn't, we store the value on the stack. Later, we load it two separate times and store values at offsets 0x20 and 0x18. The second store is the one that Matt identified as crashing, which doesn't make any sense: the only way to legitimately get that result would be if the call returned a pointer to the very end of an unmapped page, so that result + 0x20
is a valid pointer but result + 0x18
isn't. Given the strong indication that there's something hardware-specific going on here, I doubt that we're returning such a pointer.
Based on the offsets, I think we're looking at this code:
(Showing my work:
NativeObject:
0x00: shape
0x08: slots
0x10: elements
0x18: first fixed slot = TargetSlot
0x20: second fixed slot = FlagsSlot
See here for fixed slot definitions.)
It's particularly weird that the crash reports say this is a null pointer deref: first of all because we're checking for null, and secondly because we've just stored a value at an 8-byte offset from this one, so we should have already crashed.
Comment 10•1 year ago
|
||
Gotta stop leaving the nerd-catnip out for Iain
Anyhow, his analysis is deeper than mine, though I reran the bytes through capstone, which made the diassm a little bit easier to read (capstone does a better job here).
Iain's analysis concurs with the dump processor about where we're crashing, and I concur about the analysis -- this is almost certainly a hardware issue. Having said that, unlike the Samsung crash, it's unclear how we could write a patch to try to reduce the incidence of this... some sort of #pragma no-opt to wrap this code, android only?
(Dump from capstone for posterity below)
0x0: bl #0xffffffffff60efc8
0x4: cbz x0, #0x7f8
0x8: ldp x9, x8, [sp, #0x88]
0xc: str x0, [sp, #0x20] x0 stored into stack slot; x0 should be non-zero due to preceeding branch.
0x10: str x8, [x9]
0x14: ldp x9, x8, [sp, #0x58]
0x18: str x8, [x9]
0x1c: ldr x1, [x20]
0x20: adrp x26, #0x3bf8000
0x24: add x26, x26, #0xdd0
0x28: adrp x24, #0x3bf8000
0x2c: add x24, x24, #0xb68
0x30: ldr x10, [x1]
0x34: ldr x8, [x10]
0x38: ldr x9, [x8]
0x3c: ldr x8, [sp, #0x20] x8 loaded from same stack slot
0x40: cmp x9, x26
0x44: b.eq #0x50
0x48: cmp x9, x24
0x4c: b.ne #0x888
0x50: ldrb w9, [x1, #0x18]
0x54: lsr w0, w9, #7
0x58: lsl w9, w21, #1
0x5c: movz x27, #0x8000, lsl #32
0x60: movk x27, #0xfff8, lsl #48
0x64: bfxil x9, x0, #0, #1
0x68: orr x9, x9, x27
0x6c: str x9, [x8, #0x20] storing into x8, which should be the same value as x1?
0x70: ldr x8, [x20]
0x74: ldr x1, [sp, #0x20] x1 loaded from stack slot.
0x78: orr x9, x8, #0xfffe000000000000
0x7c: and x8, x8, #0x7ffffffc0000
0x80: str x9, [x1, #0x18] Crash, x1 is zero
Comment 11•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 AArch64 and ARM crashes on release
For more information, please visit BugBot documentation.
Comment 12•1 year ago
|
||
Musing about this a little bit more: the crash addresses all seem to be 0x18 or 0x28. Matt helpfully provided me with the disassembly for the instructions immediately following the crash:
0x84: ldr x0, [x8]
0x88: cbnz x0, #0x428
0x8c: cbz w23, #0x774
0x90: ldr x8, [x22]
0x94: movz x9, #0xfffb, lsl #48
0x98: ldr x1, [sp, #0x20] # Load the stack slot again
0x9c: cmp x8, x9
0xa0: str x8, [x1, #0x28] # Store at offset 0x28
We never seem to crash at 0x20, even though it's theoretically using the same base pointer. So it appears that sometimes, given preconditions that I assume are set up by the rest of the code in this function, ldr x1, [sp, #0x20]
loads a null pointer, even though the value stored at that address is non-null. Two separate instances of that instruction seem to go wrong, so it seems likely that the problem is there, not elsewhere. On the other hand, "loading a value from the stack" is one of the most basic instructions around, so it really shouldn't go wrong like this.
Comment 13•1 year ago
|
||
In the short term, there's nothing actionable here at the moment. I'm gong to set a reminder to myself for one week from now to check the crash rate for 116.2.0.
Comment 14•1 year ago
|
||
(In reply to Iain Ireland [:iain] from comment #12)
On the other hand, "loading a value from the stack" is one of the most basic instructions around, so it really shouldn't go wrong like this.
The store happens very shortly after the load which means it's likely that it never actually hit the memory, and the data is forwarded within the core, possibly from the store queue or even the forwarding network. That's a very complex and very bug-prone path in CPUs. We've already seen erratas in Intel and AMD CPUs where store-to-load forwarding was messing stuff up, returning stale data instead of the actual value.
These crashes are all on Snapdragon 820 SoCs which use Kryo cores, which have a few non-public but known errata. Chances are we're hitting one of them or possibly an unknown (or just undocumented) bug.
Comment 15•1 year ago
|
||
I was thinking about store forwarding, but the important value here is being stored at offset 0x0c in the disassembly and not loaded until offset 0x74 (or even 0x98) which seems like a big window.
Updated•1 year ago
|
Comment 16•1 year ago
|
||
This is a reminder regarding comment #2!
The bug is marked as tracked for firefox116 (release). We have limited time to fix this, the soft freeze is in 10 days. However, the bug still isn't assigned.
Comment 17•1 year ago
|
||
Given that we've seen zero (0) crashes from 116.2.0m I don't think this is going to need work.
Updated•1 year ago
|
Comment 18•1 year ago
•
|
||
So is the hypothesis that something (PGO, the phase of the moon, etc) changed the code that we generate very slightly between the previous build and this one, preventing us from running into the CPU bug?
Is there a convenient way for us to dump the equivalent code in 116.2.0, to see whether there are any differences?
Comment 19•1 year ago
|
||
(The m was an typo by the way -- sorry)
Comment 20•1 year ago
|
||
Sorry, meant to also say: Yes -- something seems to be preventing us from triggering this, deeply unclear what. It must be possible to get disasembly, but I have to imagine you'd have to grab both APKs and extract libxul from inside.
Comment 21•1 year ago
|
||
(In reply to Iain Ireland [:iain] from comment #18)
So is the hypothesis that something (PGO, the phase of the moon, etc) changed the code that we generate very slightly between the previous build and this one, preventing us from running into the CPU bug?
Yes. These bugs are often highly timing-dependent so slight changes in the code make them go away. We respun two builds of 106 to get rid of similar bugs (see this and this).
Is there a convenient way for us to dump the equivalent code in 116.2.0, to see whether there are any differences?
Possible yes, convenient no. Since several functions have been inlined at the point of crash this will all look like a blob within another function.
Comment 22•1 year ago
|
||
Thinking out loud: it seems like there are two possibilities. Either we made the same inlining decisions in the good build as the bad build, or we didn't.
In the bad build, we've called functionBindImpl
and inlined initReservedSlot
/ HeapSlot::init
. If we can dump the instructions for functionBindImpl
, then figuring out what's been inlined is just a matter of effort. If we made different inlining decisions due to PGO, then it's not at all surprising that we don't trigger the bug. If that's the case, maybe we can replace the calls to initReservedSlot
with a wrapper that doesn't get inlined on ARM64.
If we have the same inlining decisions between the two, then dumping the instructions for functionBindImpl
will let us compare the good case to the bad case directly.
My understanding is that the symbols are stored separately from the APK, so maybe this is non-trivial.
Comment 23•1 year ago
|
||
This is a reminder regarding comment #2!
The bug is marked as tracked for firefox116 (release). We have limited time to fix this, the soft freeze is in 3 days. However, the bug still isn't assigned and has low priority.
Updated•1 year ago
|
Comment 24•1 year ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Description
•