Closed Bug 1470925 Opened 3 years ago Closed 3 months ago

Crash in _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute


(Firefox for Android :: General, defect)

Firefox 62
Not set



Tracking Status
firefox61 --- affected


(Reporter: philipp, Unassigned)



(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is
report bp-6464c585-dfa0-4143-a3f4-7668e0180625.

Top 10 frames of crashing thread:

0 _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute::h529bb9ff4d49fc95 src/libcore/
1 rayon_core::registry::WorkerThread::wait_until::hb70f59fb5bf38fdf third_party/rust/rayon-core/src/
2 rayon_core::scope::scope::_$u7b$$u7b$closure$u7d$$u7d$::hebfae7f4f232d14b third_party/rust/rayon-core/src/scope/
3 rayon_core::thread_pool::ThreadPool::install::_$u7b$$u7b$closure$u7d$$u7d$::h1831231b9f6c5ff7 third_party/rust/rayon-core/src/
4 _$LT$rayon_core..job..StackJob$LT$L$C$$u20$F$C$$u20$R$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute::h1e78b059d9bca3a9 third_party/rust/rayon-core/src/
5 rayon_core::registry::WorkerThread::wait_until::hb70f59fb5bf38fdf third_party/rust/rayon-core/src/
6 std::sys_common::backtrace::__rust_begin_short_backtrace::h762fe3dc5518053a third_party/rust/rayon-core/src/
7 _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h22c1df5e1d933032 src/libstd/thread/
8 std::sys::unix::thread::Thread::new::thread_start::h38f4e2017be83d0f src/liballoc/


this crash signature is popping up in the 61 release build for fennec - it's only occurring on devices with snapdragon 820 cpus.

perhaps it's a build specific issue, like we have seen a number of times in the past with bug 1434994, bug 1454639, bug 1451741 et al.
This is the #5 topcrash on Fennec 61 on release and like our #1 crash, bug 1468541, it's all on Snapdragon 820 CPUs. Don't know if they're directly related, but something sure seems off here.
See Also: → 1468541
:bholley, could you investigate please ?
Flags: needinfo?(bobbyholley)
The crash here is in Rayon, which is the standard threading library in Rust. It's not in style system code per se.

So the two most likely possibilities are that:
(1) Rayon is making some kind of architectural assumption that occasionally breaks on Snapdragon 820s, or
(2) We have another bug elsewhere in the system that is corrupting Rayon's memory.

The crash reports here are remarkably consistent, in that they're always a segfault at 0x40 at [1]. Given the various magic about repr(simd) above, I do wonder whether there's something architecture-specific going wrong with this code. Niko, can you have a look? Feel free to delegate to another expert on this code if needed.

Flags: needinfo?(bobbyholley) → needinfo?(nmatsakis)
We are approaching 4000K crashes on 61 release. Will Niko be able to look at this soon? Thanks.
From irc analysis by eddyb, this is what the code around the crash site looks like:

 1ebff60:       b5f0            push    {r4, r5, r6, r7, lr}
 1ebff62:       af03            add     r7, sp, #12
 1ebff64:       e92d 0f00       stmdb   sp!, {r8, r9, sl, fp}
 1ebff68:       f5ad 7d6f       sub.w   sp, sp, #956    ; 0x3bc
 1ebff6c:       466c            mov     r4, sp
 1ebff6e:       f36f 0404       bfc     r4, #0, #5
 1ebff72:       46a5            mov     sp, r4
 1ebff74:       69c1            ldr     r1, [r0, #28]
 1ebff76:       f04f 0e00       mov.w   lr, #0
 1ebff7a:       9118            str     r1, [sp, #96]   ; 0x60
 1ebff7c:       6981            ldr     r1, [r0, #24]
 1ebff7e:       9117            str     r1, [sp, #92]   ; 0x5c
 1ebff80:       6941            ldr     r1, [r0, #20]
 1ebff82:       9114            str     r1, [sp, #80]   ; 0x50
 1ebff84:       6901            ldr     r1, [r0, #16]
 1ebff86:       9113            str     r1, [sp, #76]   ; 0x4c
 1ebff88:       6881            ldr     r1, [r0, #8]
 1ebff8a:       6805            ldr     r5, [r0, #0]
 1ebff8c:       6844            ldr     r4, [r0, #4]
 1ebff8e:       9112            str     r1, [sp, #72]   ; 0x48
 1ebff90:       2d00            cmp     r5, #0
 1ebff92:       f8d0 900c       ldr.w   r9, [r0, #12]
 1ebff96:       f8c0 e01c       str.w   lr, [r0, #28]
 1ebff9a:       f8c0 e018       str.w   lr, [r0, #24]
 1ebff9e:       f8c0 e014       str.w   lr, [r0, #20]
 1ebffa2:       f8c0 e010       str.w   lr, [r0, #16]
 1ebffa6:       f8c0 e00c       str.w   lr, [r0, #12]
 1ebffaa:       f8c0 e008       str.w   lr, [r0, #8]
 1ebffae:       f8c0 e004       str.w   lr, [r0, #4]
 1ebffb2:       f8c0 e000       str.w   lr, [r0]
 1ebffb6:       6bc1            ldr     r1, [r0, #60]   ; 0x3c
 1ebffb8:       9119            str     r1, [sp, #100]  ; 0x64
 1ebffba:       6b81            ldr     r1, [r0, #56]   ; 0x38
 1ebffbc:       911b            str     r1, [sp, #108]  ; 0x6c
 1ebffbe:       6b41            ldr     r1, [r0, #52]   ; 0x34
 1ebffc0:       911c            str     r1, [sp, #112]  ; 0x70
 1ebffc2:       6b01            ldr     r1, [r0, #48]   ; 0x30
 1ebffc4:       911d            str     r1, [sp, #116]  ; 0x74
 1ebffc6:       6a01            ldr     r1, [r0, #32]
 1ebffc8:       911f            str     r1, [sp, #124]  ; 0x7c
 1ebffca:       6a41            ldr     r1, [r0, #36]   ; 0x24
 1ebffcc:       911e            str     r1, [sp, #120]  ; 0x78
 1ebffce:       6a81            ldr     r1, [r0, #40]   ; 0x28
 1ebffd0:       9116            str     r1, [sp, #88]   ; 0x58
 1ebffd2:       6ac1            ldr     r1, [r0, #44]   ; 0x2c
 1ebffd4:       9115            str     r1, [sp, #84]   ; 0x54
 1ebffd6:       f8c0 e03c       str.w   lr, [r0, #60]   ; 0x3c
 1ebffda:       f8c0 e038       str.w   lr, [r0, #56]   ; 0x38
 1ebffde:       f8c0 e034       str.w   lr, [r0, #52]   ; 0x34
 1ebffe2:       f8c0 e030       str.w   lr, [r0, #48]   ; 0x30
 1ebffe6:       f8c0 e02c       str.w   lr, [r0, #44]   ; 0x2c
 1ebffea:       f8c0 e028       str.w   lr, [r0, #40]   ; 0x28
 1ebffee:       f8c0 e024       str.w   lr, [r0, #36]   ; 0x24
 1ebfff2:       f8c0 e020       str.w   lr, [r0, #32]
 1ebfff6:       f8d0 805c       ldr.w   r8, [r0, #92]   ; 0x5c
 1ebfffa:       6d86            ldr     r6, [r0, #88]   ; 0x58
 1ebfffc:       6d42            ldr     r2, [r0, #84]   ; 0x54
 1ebfffe:       f8d0 c050       ldr.w   ip, [r0, #80]   ; 0x50
 1ec0002:       f8d0 a040       ldr.w   sl, [r0, #64]   ; 0x40
 1ec0006:       f8d0 b044       ldr.w   fp, [r0, #68]   ; 0x44
 1ec000a:       6c81            ldr     r1, [r0, #72]   ; 0x48
 1ec000c:       6cc3            ldr     r3, [r0, #76]   ; 0x4c
 1ec000e:       f8c0 e05c       str.w   lr, [r0, #92]   ; 0x5c
 1ec0012:       f8c0 e058       str.w   lr, [r0, #88]   ; 0x58
 1ec0016:       f8c0 e054       str.w   lr, [r0, #84]   ; 0x54
 1ec001a:       f8c0 e050       str.w   lr, [r0, #80]   ; 0x50
 1ec001e:       f8c0 e04c       str.w   lr, [r0, #76]   ; 0x4c
 1ec0022:       f8c0 e048       str.w   lr, [r0, #72]   ; 0x48
 1ec0026:       f8c0 e044       str.w   lr, [r0, #68]   ; 0x44
 1ec002a:       900e            str     r0, [sp, #56]   ; 0x38
 1ec002c:       f8c0 e040       str.w   lr, [r0, #64]   ; 0x40

The crash site is 1ec0002, so the crashing instruction is:
 1ec0002:       f8d0 a040       ldr.w   sl, [r0, #64]   ; 0x40

with r0 being 0, that's a write to 0x40, which is consistent with the crash address.
The preceding instruction is:
 1ebfffe:       f8d0 c050       ldr.w   ip, [r0, #80]   ; 0x50

There is no way this can have succeeded with the same value of r0, and eddyb says it looks like nothing is jumping there.

It's worth noting that the instruction is a) not aligned, and b) on a page boundary.

One thing that I can imagine going wrong here is if a page fault is happening and r0 is not restored to the right value. But that would be pretty bad and would presumably trigger more crashes than just this.
<nagisa> nah, I’ll just do the armchair debugging for my enjoyment
<nagisa> (I debug such arm crashes professionally for cortex-Ms, and I’d like to think I’ve seen everything :D)
<nagisa> yeah, there’s nothing wrong with that code locally, at least I can’t see anything
<nagisa> In a no-OS/embedded context the first suspect would be an interrupt corrupting its return stack.
<nagisa> but since this is android, my two wild shots are: 1) the thread that was executhing this code was deschedulled and some other thread overwrote parts of the stack of this function 2) a signal handler executed and overwirtten part of the stack for this function
<nagisa> something like that, anyway
snorp is investigating this with eddyb and nagisa on #rustc, NI him here to log their conclusions.
Flags: needinfo?(nmatsakis) → needinfo?(snorp)
I don't think we came to any conclusions other than "looks like stack/register corruption". Still no idea what's happening.
Flags: needinfo?(snorp)
(In reply to James Willcox (:snorp) ( from comment #8)
> I don't think we came to any conclusions other than "looks like
> stack/register corruption". Still no idea what's happening.

There were some very specific investigative observations and conclusions in the IRC log, even if the root cause was still unknown. It seems worth logging those here, or at least linking to them for reference by anyone else who picks this up.
Here's random theory, also after looking at bug 1472526.  It is most likely
a total red herring, but anyway:

This is Thumb-2 code.  Thumb-2 has a conditionalisation model in which the
processor (at least conceptually) has a 4-entry shift register, ITSTATE.
Each entry is a guarding condition for the instruction, a la traditional
ARM32 (EQ, NE, etc).  For each instruction, a condition is shifted out of
the register and used.  Unless the program modifies ITSTATE explicitly, the
vacated spaces are filled with "AL" (always-execute) condition codes.

At the crash site, we expect ITSTATE to be [AL,AL,AL,AL] since there is no
preceding insn which sets it otherwise.

Now I wonder if the presence of the page boundary has caused ITSTATE to
become corrupted somehow.  Either by a hardware bug, or by a kernel bug as a
result of a trip into the kernel to service a fault on the
just-about-to-be-entered code page.  Either way, that might just explain how
the second instruction fails yet the first doesn't fail -- because the
associated corrupted ITSTATE entry causes it not to get executed.
See Also: → 1472526
I think what's going on here has to do with the instruction at 0x1ebfffe, 

> 1ebfffe:       f8d0 c050       ldr.w   ip, [r0, #80]   ; 0x50

If the second part of that instruction is read as 0000 (because of the page boundary), the instruction becomes,

> 1ebfffe:       f8d0 0000       ldr.w   r0, [r0]

We know r0 is likely a valid pointer at this point, but it's reasonable to think that [r0] is 0. So r0 becomes 0 at this point, and the next instruction causes the crash at 0x40,

> 1ec0002:       f8d0 a040       ldr.w   sl, [r0, #64]   ; 0x40

Bug 1472526 can be explained the same way.

However, I have no idea what's causing the second part of the thumb-2 instruction to be read as 0000 when crossing the page boundary.
FWIW, the actual address for 0x1ec0002 in is 0xd0212002 per the crash report, and /proc/self/maps for the crashing process says:

ce66f000-d097c000 r-xp 00107000 fd:01 3326050                            /data/data/org.mozilla.firefox/cache/

So there's nothing weird like something else having mapped an empty page at that address or something.
Re-triaging per

Needinfo :susheel if you think this bug should be re-triaged.
Priority: -- → P5
This bug has accumulated over 33K in crashes on release.
Flags: needinfo?(sdaswani)
Removing my NI until I hear that it's something the Softvision folks should pick up. My guess is the platform team may want to pick this up? (This was the hint that it's a platform issue: "This is Thumb-2 code." :) )
Flags: needinfo?(sdaswani)
Adding a similar signature which showed up with 419 crashes/338 installations.  Here are some comments in those crashes:

* was searching on air bnb when the crashed window opened. 
* reading Yahoo mail. switched to page recommended in mail from Amazon. Crash 
* I tried to open up a donation page for Color of Change. 
* sharing on fb 
* opening a new tab

ni on :dbolter since I don't know who on the platform side could look at this.
Crash Signature: [@ _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute] → [@ _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute] [@ _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute::h529bb9ff4d49fc95]
Flags: needinfo?(dbolter)
changing ni to :snorp based on channel meeting discussion.
Flags: needinfo?(dbolter) → needinfo?(snorp)
jchen is currently doing the wrangling with Qualcomm on this and other bugs, assigning to him.
Assignee: nobody → nchen
Flags: needinfo?(snorp)
Priority: P5 → --
Similar to Bug 1468541, these signatures appear so far to be gone entirely in 61.0.2 which shipped last week (and
had no related changes).
Assignee: nchen → nobody

Did we get any response from Qualcomm on this? We're seeing an increase in occurrences of bug 1472526.

(In reply to Kevin Jacobs [:kjacobs] from comment #20)

Did we get any response from Qualcomm on this? We're seeing an increase in occurrences of bug 1472526.

Comment 18 mentions jchen was working on this, but he has since left Mozilla. Not sure who, if anyone, followed up with QC regarding this.

I've emailed the Mozilla/Qualcomm mailing list to ask for help with this issue.

Both of these signatures appear to no longer be crashing (the first one only crashes in 61.0). I recommend we close this one out since the signature may have shifted.

Closing because no crashes reported for 12 weeks.

Closed: 3 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.