Open Bug 1847414 Opened 1 year ago Updated 1 year ago

Crash in [@ js::HeapSlot::init] with graphics adapter Adreno (TM) 530

Categories

(Core :: JavaScript Engine, defect, P3)

Unspecified
Android
defect

Tracking

()

Tracking Status
firefox-esr102 --- affected
firefox-esr115 --- affected
firefox116 + wontfix
firefox117 --- affected
firefox118 --- affected

People

(Reporter: cpeterson, Unassigned)

References

Details

(Keywords: crash)

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/8087ee04-b034-4afa-b29a-012990230806

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0  libxul.so  js::HeapSlot::init  js/src/gc/Barrier.h:947
0  libxul.so  js::NativeObject::initReservedSlot  js/src/vm/NativeObject.h:1288
0  libxul.so  js::BoundFunctionObject::functionBindImpl  js/src/vm/BoundFunctionObject.cpp:373
1  libxul.so  js::BoundFunctionObject::functionBind  js/src/vm/BoundFunctionObject.cpp:286
2  libxul.so  CallJSNative  js/src/vm/Interpreter.cpp:486
2  libxul.so  js::InternalCallOrConstruct  js/src/vm/Interpreter.cpp:580
2  libxul.so  InternalCall  js/src/vm/Interpreter.cpp:647
2  libxul.so  js::Call  js/src/vm/Interpreter.cpp:679
3  libxul.so  js::fun_call  js/src/vm/JSFunction.cpp:956
4  ?  @0x0000007cf2956cd0  

This isn't a new crash signature, but it appears there is a crash spike in Firefox Android 116.0.

OS: All → Android

The bug is marked as tracked for firefox116 (release). However, the bug still isn't assigned.

:sdetar, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(sdetar)

This is a null deref crash.

It looks like all of the crashes have js::HeapSlot::init and BoundFunctionObject::functionBindImpl in their stack. I didn't see any code near the top of the stack that changed in 116.

Looking at the last month of Fenix crashes, for those with BoundFunctionObject::functionBindImpl in the proto stack, 93% of them are this crash, so it is not an obvious signature change.

ni? myself to try and get bytes out of the minidump.

This reminds me a bit of Bug 1833315, which was a CPU bug exposed through the compiler.

Flags: needinfo?(mgaudet)

(In reply to Matthew Gaudet (he/him) [:mgaudet] from comment #5)

This reminds me a bit of Bug 1833315, which was a CPU bug exposed through the compiler.

Ah, good point. Looking at Summary tab for this signature, 99.8% of these crashes on Android have the graphics adapter "Adreno (TM) 530". Looking across all Fenix crashes, only 0.51% of crashes have that adapter.

In fact, it looks like about 40% of all crashes with that adapter have this CPU value.

Summary: Crash in [@ js::HeapSlot::init] → Crash in [@ js::HeapSlot::init] with graphics adapter Adreno (TM) 530
See Also: → 1833315

(In reply to Andrew McCreight [:mccr8] from comment #6)

In fact, it looks like about 40% of all crashes with that adapter have this CPU value.

Sorry, I meant "about 40% of all crashes with that adapter have this signature". More specifically, there are about 1000 crashes total with this graphics adapter in the last 7 days, and around 450 crashes in the last 7 days with this signature.

Ok; getminidump-instructions just worked for me

Using disasm.pro I get the following disassembly up to the crashing instruction:

bl #0x6c1aa64d08
cbz x0, #0x6c1b456538
ldp x9, x8, [sp, #0x88]
str x0, [sp, #0x20]
str x8, [x9]
ldp x9, x8, [sp, #0x58]
str x8, [x9]
ldr x1, [x20]
adrp x26, #0x6c1f04d000
add x26, x26, #0xdd0
adrp x24, #0x6c1f04d000
add x24, x24, #0xb68
ldr x10, [x1]
ldr x8, [x10]
ldr x9, [x8]
ldr x8, [sp, #0x20]
cmp x9, x26
b.eq #0x6c1b455d90
cmp x9, x24
b.ne #0x6c1b4565c8
ldrb w9, [x1, #0x18]
lsr w0, w9, #7
lsl w9, w21, #1
movz x27, #0x8000, lsl #32
movk x27, #0xfff8, lsl #48
bfxil x9, x0, #0, #1
orr x9, x9, x27
str x9, [x8, #0x20]
ldr x8, [x20]
ldr x1, [sp, #0x20]
orr x9, x8, #0xfffe000000000000
and x8, x8, #0x7ffffffc0000
str x9, [x1, #0x18] <--- Crash here. 

(Posting now, just headed into a meeting, haven't got a chance to actually read dump yet)

Taking a quick peek at this, while trying not to nerd-snipe myself too badly:

bl #0x6c1aa64d08               # call something
cbz x0, #0x6c1b456538          # check result
ldp x9, x8, [sp, #0x88]        
str x0, [sp, #0x20]            # save the result in the stack frame
str x8, [x9]                   
ldp x9, x8, [sp, #0x58]        
str x8, [x9]                   
ldr x1, [x20]                  
adrp x26, #0x6c1f04d000        
add x26, x26, #0xdd0
adrp x24, #0x6c1f04d000        
add x24, x24, #0xb68
ldr x10, [x1]                  
ldr x8, [x10]
ldr x9, [x8]
ldr x8, [sp, #0x20]            # load the result in x8
cmp x9, x26                    
b.eq #0x6c1b455d90
cmp x9, x24
b.ne #0x6c1b4565c8
ldrb w9, [x1, #0x18]           
lsr w0, w9, #7                 
lsl w9, w21, #1                
movz x27, #0x8000, lsl #32     
movk x27, #0xfff8, lsl #48
bfxil x9, x0, #0, #1           
orr x9, x9, x27                
str x9, [x8, #0x20]            # store a value at offset 0x20 from result
ldr x8, [x20]                  
ldr x1, [sp, #0x20]            # load the result again         
orr x9, x8, #0xfffe000000000000
and x8, x8, #0x7ffffffc0000    
str x9, [x1, #0x18]            # store a value at offset 0x18 from result

We start by doing a call, and checking whether the result is null. If it isn't, we store the value on the stack. Later, we load it two separate times and store values at offsets 0x20 and 0x18. The second store is the one that Matt identified as crashing, which doesn't make any sense: the only way to legitimately get that result would be if the call returned a pointer to the very end of an unmapped page, so that result + 0x20 is a valid pointer but result + 0x18 isn't. Given the strong indication that there's something hardware-specific going on here, I doubt that we're returning such a pointer.

Based on the offsets, I think we're looking at this code:

(Showing my work:

NativeObject:
0x00: shape
0x08: slots
0x10: elements
0x18: first fixed slot = TargetSlot
0x20: second fixed slot = FlagsSlot

See here for fixed slot definitions.)

It's particularly weird that the crash reports say this is a null pointer deref: first of all because we're checking for null, and secondly because we've just stored a value at an 8-byte offset from this one, so we should have already crashed.

Gotta stop leaving the nerd-catnip out for Iain

Anyhow, his analysis is deeper than mine, though I reran the bytes through capstone, which made the diassm a little bit easier to read (capstone does a better job here).

Iain's analysis concurs with the dump processor about where we're crashing, and I concur about the analysis -- this is almost certainly a hardware issue. Having said that, unlike the Samsung crash, it's unclear how we could write a patch to try to reduce the incidence of this... some sort of #pragma no-opt to wrap this code, android only?

(Dump from capstone for posterity below)

0x0:	bl  #0xffffffffff60efc8
0x4:	cbz  x0, #0x7f8
0x8:	ldp  x9, x8, [sp, #0x88]
0xc:	str  x0, [sp, #0x20]				x0 stored into stack slot; x0 should be non-zero due to preceeding branch. 
0x10:	str  x8, [x9]
0x14:	ldp  x9, x8, [sp, #0x58]
0x18:	str  x8, [x9]
0x1c:	ldr  x1, [x20]
0x20:	adrp  x26, #0x3bf8000
0x24:	add  x26, x26, #0xdd0
0x28:	adrp  x24, #0x3bf8000
0x2c:	add  x24, x24, #0xb68
0x30:	ldr  x10, [x1]
0x34:	ldr  x8, [x10]
0x38:	ldr  x9, [x8]
0x3c:	ldr  x8, [sp, #0x20]				x8 loaded from same stack slot
0x40:	cmp  x9, x26
0x44:	b.eq  #0x50
0x48:	cmp  x9, x24
0x4c:	b.ne  #0x888
0x50:	ldrb  w9, [x1, #0x18]
0x54:	lsr  w0, w9, #7
0x58:	lsl  w9, w21, #1
0x5c:	movz  x27, #0x8000, lsl #32
0x60:	movk  x27, #0xfff8, lsl #48
0x64:	bfxil  x9, x0, #0, #1
0x68:	orr  x9, x9, x27
0x6c:	str  x9, [x8, #0x20]				storing into x8, which should be the same value as x1? 
0x70:	ldr  x8, [x20]
0x74:	ldr  x1, [sp, #0x20] 				x1 loaded from stack slot.
0x78:	orr  x9, x8, #0xfffe000000000000
0x7c:	and  x8, x8, #0x7ffffffc0000
0x80:	str  x9, [x1, #0x18]				Crash, x1 is zero
Flags: needinfo?(mgaudet)

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on release

For more information, please visit BugBot documentation.

Keywords: topcrash

Musing about this a little bit more: the crash addresses all seem to be 0x18 or 0x28. Matt helpfully provided me with the disassembly for the instructions immediately following the crash:

0x84:	ldr  x0, [x8]
0x88:	cbnz  x0, #0x428
0x8c:	cbz  w23, #0x774
0x90:	ldr  x8, [x22]
0x94:	movz  x9, #0xfffb, lsl #48
0x98:	ldr  x1, [sp, #0x20]             # Load the stack slot again
0x9c:	cmp  x8, x9
0xa0:	str  x8, [x1, #0x28]             # Store at offset 0x28

We never seem to crash at 0x20, even though it's theoretically using the same base pointer. So it appears that sometimes, given preconditions that I assume are set up by the rest of the code in this function, ldr x1, [sp, #0x20] loads a null pointer, even though the value stored at that address is non-null. Two separate instances of that instruction seem to go wrong, so it seems likely that the problem is there, not elsewhere. On the other hand, "loading a value from the stack" is one of the most basic instructions around, so it really shouldn't go wrong like this.

In the short term, there's nothing actionable here at the moment. I'm gong to set a reminder to myself for one week from now to check the crash rate for 116.2.0.

(In reply to Iain Ireland [:iain] from comment #12)

On the other hand, "loading a value from the stack" is one of the most basic instructions around, so it really shouldn't go wrong like this.

The store happens very shortly after the load which means it's likely that it never actually hit the memory, and the data is forwarded within the core, possibly from the store queue or even the forwarding network. That's a very complex and very bug-prone path in CPUs. We've already seen erratas in Intel and AMD CPUs where store-to-load forwarding was messing stuff up, returning stale data instead of the actual value.

These crashes are all on Snapdragon 820 SoCs which use Kryo cores, which have a few non-public but known errata. Chances are we're hitting one of them or possibly an unknown (or just undocumented) bug.

I was thinking about store forwarding, but the important value here is being stored at offset 0x0c in the disassembly and not loaded until offset 0x74 (or even 0x98) which seems like a big window.

Flags: needinfo?(sdetar)

This is a reminder regarding comment #2!

The bug is marked as tracked for firefox116 (release). We have limited time to fix this, the soft freeze is in 10 days. However, the bug still isn't assigned.

Given that we've seen zero (0) crashes from 116.2.0m I don't think this is going to need work.

Severity: -- → S3
Priority: -- → P3

So is the hypothesis that something (PGO, the phase of the moon, etc) changed the code that we generate very slightly between the previous build and this one, preventing us from running into the CPU bug?

Is there a convenient way for us to dump the equivalent code in 116.2.0, to see whether there are any differences?

(The m was an typo by the way -- sorry)

Sorry, meant to also say: Yes -- something seems to be preventing us from triggering this, deeply unclear what. It must be possible to get disasembly, but I have to imagine you'd have to grab both APKs and extract libxul from inside.

(In reply to Iain Ireland [:iain] from comment #18)

So is the hypothesis that something (PGO, the phase of the moon, etc) changed the code that we generate very slightly between the previous build and this one, preventing us from running into the CPU bug?

Yes. These bugs are often highly timing-dependent so slight changes in the code make them go away. We respun two builds of 106 to get rid of similar bugs (see this and this).

Is there a convenient way for us to dump the equivalent code in 116.2.0, to see whether there are any differences?

Possible yes, convenient no. Since several functions have been inlined at the point of crash this will all look like a blob within another function.

Thinking out loud: it seems like there are two possibilities. Either we made the same inlining decisions in the good build as the bad build, or we didn't.

In the bad build, we've called functionBindImpl and inlined initReservedSlot/ HeapSlot::init. If we can dump the instructions for functionBindImpl, then figuring out what's been inlined is just a matter of effort. If we made different inlining decisions due to PGO, then it's not at all surprising that we don't trigger the bug. If that's the case, maybe we can replace the calls to initReservedSlot with a wrapper that doesn't get inlined on ARM64.

If we have the same inlining decisions between the two, then dumping the instructions for functionBindImpl will let us compare the good case to the bad case directly.

My understanding is that the symbols are stored separately from the APK, so maybe this is non-trivial.

This is a reminder regarding comment #2!

The bug is marked as tracked for firefox116 (release). We have limited time to fix this, the soft freeze is in 3 days. However, the bug still isn't assigned and has low priority.

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
You need to log in before you can comment on or make changes to this bug.