Closed Bug 1470925 Opened 7 years ago Closed 5 years ago

Crash in _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute

Categories

(Firefox for Android Graveyard :: General, defect)

Firefox 62
ARM
Android
defect
Not set
critical

Tracking

(firefox61 affected)

RESOLVED WORKSFORME
Tracking Status
firefox61 --- affected

People

(Reporter: philipp, Unassigned)

References

Details

(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is report bp-6464c585-dfa0-4143-a3f4-7668e0180625. ============================================================= Top 10 frames of crashing thread: 0 libxul.so _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute::h529bb9ff4d49fc95 src/libcore/ptr.rs:223 1 libxul.so rayon_core::registry::WorkerThread::wait_until::hb70f59fb5bf38fdf third_party/rust/rayon-core/src/job.rs:60 2 libxul.so rayon_core::scope::scope::_$u7b$$u7b$closure$u7d$$u7d$::hebfae7f4f232d14b third_party/rust/rayon-core/src/scope/mod.rs:392 3 libxul.so rayon_core::thread_pool::ThreadPool::install::_$u7b$$u7b$closure$u7d$$u7d$::h1831231b9f6c5ff7 third_party/rust/rayon-core/src/registry.rs:713 4 libxul.so _$LT$rayon_core..job..StackJob$LT$L$C$$u20$F$C$$u20$R$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute::h1e78b059d9bca3a9 third_party/rust/rayon-core/src/registry.rs:355 5 libxul.so rayon_core::registry::WorkerThread::wait_until::hb70f59fb5bf38fdf third_party/rust/rayon-core/src/job.rs:60 6 libxul.so std::sys_common::backtrace::__rust_begin_short_backtrace::h762fe3dc5518053a third_party/rust/rayon-core/src/registry.rs:674 7 libxul.so _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h22c1df5e1d933032 src/libstd/thread/mod.rs:406 8 libxul.so std::sys::unix::thread::Thread::new::thread_start::h38f4e2017be83d0f src/liballoc/boxed.rs:825 9 libc.so libc.so@0x47483 ============================================================= this crash signature is popping up in the 61 release build for fennec - it's only occurring on devices with snapdragon 820 cpus. perhaps it's a build specific issue, like we have seen a number of times in the past with bug 1434994, bug 1454639, bug 1451741 et al.
This is the #5 topcrash on Fennec 61 on release and like our #1 crash, bug 1468541, it's all on Snapdragon 820 CPUs. Don't know if they're directly related, but something sure seems off here.
See Also: → 1468541
:bholley, could you investigate please ?
Flags: needinfo?(bobbyholley)
The crash here is in Rayon, which is the standard threading library in Rust. It's not in style system code per se. So the two most likely possibilities are that: (1) Rayon is making some kind of architectural assumption that occasionally breaks on Snapdragon 820s, or (2) We have another bug elsewhere in the system that is corrupting Rayon's memory. The crash reports here are remarkably consistent, in that they're always a segfault at 0x40 at [1]. Given the various magic about repr(simd) above, I do wonder whether there's something architecture-specific going wrong with this code. Niko, can you have a look? Feel free to delegate to another expert on this code if needed. [1] https://github.com/rust-lang/rust/blob/4d90ac38c0b61bb69470b61ea2cccea0df48d9e5/src/libcore/ptr.rs#L223
Flags: needinfo?(bobbyholley) → needinfo?(nmatsakis)
We are approaching 4000K crashes on 61 release. Will Niko be able to look at this soon? Thanks.
From irc analysis by eddyb, this is what the code around the crash site looks like: 1ebff60: b5f0 push {r4, r5, r6, r7, lr} 1ebff62: af03 add r7, sp, #12 1ebff64: e92d 0f00 stmdb sp!, {r8, r9, sl, fp} 1ebff68: f5ad 7d6f sub.w sp, sp, #956 ; 0x3bc 1ebff6c: 466c mov r4, sp 1ebff6e: f36f 0404 bfc r4, #0, #5 1ebff72: 46a5 mov sp, r4 1ebff74: 69c1 ldr r1, [r0, #28] 1ebff76: f04f 0e00 mov.w lr, #0 1ebff7a: 9118 str r1, [sp, #96] ; 0x60 1ebff7c: 6981 ldr r1, [r0, #24] 1ebff7e: 9117 str r1, [sp, #92] ; 0x5c 1ebff80: 6941 ldr r1, [r0, #20] 1ebff82: 9114 str r1, [sp, #80] ; 0x50 1ebff84: 6901 ldr r1, [r0, #16] 1ebff86: 9113 str r1, [sp, #76] ; 0x4c 1ebff88: 6881 ldr r1, [r0, #8] 1ebff8a: 6805 ldr r5, [r0, #0] 1ebff8c: 6844 ldr r4, [r0, #4] 1ebff8e: 9112 str r1, [sp, #72] ; 0x48 1ebff90: 2d00 cmp r5, #0 1ebff92: f8d0 900c ldr.w r9, [r0, #12] 1ebff96: f8c0 e01c str.w lr, [r0, #28] 1ebff9a: f8c0 e018 str.w lr, [r0, #24] 1ebff9e: f8c0 e014 str.w lr, [r0, #20] 1ebffa2: f8c0 e010 str.w lr, [r0, #16] 1ebffa6: f8c0 e00c str.w lr, [r0, #12] 1ebffaa: f8c0 e008 str.w lr, [r0, #8] 1ebffae: f8c0 e004 str.w lr, [r0, #4] 1ebffb2: f8c0 e000 str.w lr, [r0] 1ebffb6: 6bc1 ldr r1, [r0, #60] ; 0x3c 1ebffb8: 9119 str r1, [sp, #100] ; 0x64 1ebffba: 6b81 ldr r1, [r0, #56] ; 0x38 1ebffbc: 911b str r1, [sp, #108] ; 0x6c 1ebffbe: 6b41 ldr r1, [r0, #52] ; 0x34 1ebffc0: 911c str r1, [sp, #112] ; 0x70 1ebffc2: 6b01 ldr r1, [r0, #48] ; 0x30 1ebffc4: 911d str r1, [sp, #116] ; 0x74 1ebffc6: 6a01 ldr r1, [r0, #32] 1ebffc8: 911f str r1, [sp, #124] ; 0x7c 1ebffca: 6a41 ldr r1, [r0, #36] ; 0x24 1ebffcc: 911e str r1, [sp, #120] ; 0x78 1ebffce: 6a81 ldr r1, [r0, #40] ; 0x28 1ebffd0: 9116 str r1, [sp, #88] ; 0x58 1ebffd2: 6ac1 ldr r1, [r0, #44] ; 0x2c 1ebffd4: 9115 str r1, [sp, #84] ; 0x54 1ebffd6: f8c0 e03c str.w lr, [r0, #60] ; 0x3c 1ebffda: f8c0 e038 str.w lr, [r0, #56] ; 0x38 1ebffde: f8c0 e034 str.w lr, [r0, #52] ; 0x34 1ebffe2: f8c0 e030 str.w lr, [r0, #48] ; 0x30 1ebffe6: f8c0 e02c str.w lr, [r0, #44] ; 0x2c 1ebffea: f8c0 e028 str.w lr, [r0, #40] ; 0x28 1ebffee: f8c0 e024 str.w lr, [r0, #36] ; 0x24 1ebfff2: f8c0 e020 str.w lr, [r0, #32] 1ebfff6: f8d0 805c ldr.w r8, [r0, #92] ; 0x5c 1ebfffa: 6d86 ldr r6, [r0, #88] ; 0x58 1ebfffc: 6d42 ldr r2, [r0, #84] ; 0x54 1ebfffe: f8d0 c050 ldr.w ip, [r0, #80] ; 0x50 1ec0002: f8d0 a040 ldr.w sl, [r0, #64] ; 0x40 1ec0006: f8d0 b044 ldr.w fp, [r0, #68] ; 0x44 1ec000a: 6c81 ldr r1, [r0, #72] ; 0x48 1ec000c: 6cc3 ldr r3, [r0, #76] ; 0x4c 1ec000e: f8c0 e05c str.w lr, [r0, #92] ; 0x5c 1ec0012: f8c0 e058 str.w lr, [r0, #88] ; 0x58 1ec0016: f8c0 e054 str.w lr, [r0, #84] ; 0x54 1ec001a: f8c0 e050 str.w lr, [r0, #80] ; 0x50 1ec001e: f8c0 e04c str.w lr, [r0, #76] ; 0x4c 1ec0022: f8c0 e048 str.w lr, [r0, #72] ; 0x48 1ec0026: f8c0 e044 str.w lr, [r0, #68] ; 0x44 1ec002a: 900e str r0, [sp, #56] ; 0x38 1ec002c: f8c0 e040 str.w lr, [r0, #64] ; 0x40 The crash site is 1ec0002, so the crashing instruction is: 1ec0002: f8d0 a040 ldr.w sl, [r0, #64] ; 0x40 with r0 being 0, that's a write to 0x40, which is consistent with the crash address. The preceding instruction is: 1ebfffe: f8d0 c050 ldr.w ip, [r0, #80] ; 0x50 There is no way this can have succeeded with the same value of r0, and eddyb says it looks like nothing is jumping there. It's worth noting that the instruction is a) not aligned, and b) on a page boundary. One thing that I can imagine going wrong here is if a page fault is happening and r0 is not restored to the right value. But that would be pretty bad and would presumably trigger more crashes than just this.
<nagisa> nah, I’ll just do the armchair debugging for my enjoyment <nagisa> (I debug such arm crashes professionally for cortex-Ms, and I’d like to think I’ve seen everything :D) ... <nagisa> yeah, there’s nothing wrong with that code locally, at least I can’t see anything <nagisa> In a no-OS/embedded context the first suspect would be an interrupt corrupting its return stack. <nagisa> but since this is android, my two wild shots are: 1) the thread that was executhing this code was deschedulled and some other thread overwrote parts of the stack of this function 2) a signal handler executed and overwirtten part of the stack for this function <nagisa> something like that, anyway
snorp is investigating this with eddyb and nagisa on #rustc, NI him here to log their conclusions.
Flags: needinfo?(nmatsakis) → needinfo?(snorp)
I don't think we came to any conclusions other than "looks like stack/register corruption". Still no idea what's happening.
Flags: needinfo?(snorp)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #8) > I don't think we came to any conclusions other than "looks like > stack/register corruption". Still no idea what's happening. There were some very specific investigative observations and conclusions in the IRC log, even if the root cause was still unknown. It seems worth logging those here, or at least linking to them for reference by anyone else who picks this up.
Here's random theory, also after looking at bug 1472526. It is most likely a total red herring, but anyway: This is Thumb-2 code. Thumb-2 has a conditionalisation model in which the processor (at least conceptually) has a 4-entry shift register, ITSTATE. Each entry is a guarding condition for the instruction, a la traditional ARM32 (EQ, NE, etc). For each instruction, a condition is shifted out of the register and used. Unless the program modifies ITSTATE explicitly, the vacated spaces are filled with "AL" (always-execute) condition codes. At the crash site, we expect ITSTATE to be [AL,AL,AL,AL] since there is no preceding insn which sets it otherwise. Now I wonder if the presence of the page boundary has caused ITSTATE to become corrupted somehow. Either by a hardware bug, or by a kernel bug as a result of a trip into the kernel to service a fault on the just-about-to-be-entered code page. Either way, that might just explain how the second instruction fails yet the first doesn't fail -- because the associated corrupted ITSTATE entry causes it not to get executed.
I think what's going on here has to do with the instruction at 0x1ebfffe, > 1ebfffe: f8d0 c050 ldr.w ip, [r0, #80] ; 0x50 If the second part of that instruction is read as 0000 (because of the page boundary), the instruction becomes, > 1ebfffe: f8d0 0000 ldr.w r0, [r0] We know r0 is likely a valid pointer at this point, but it's reasonable to think that [r0] is 0. So r0 becomes 0 at this point, and the next instruction causes the crash at 0x40, > 1ec0002: f8d0 a040 ldr.w sl, [r0, #64] ; 0x40 Bug 1472526 can be explained the same way. However, I have no idea what's causing the second part of the thumb-2 instruction to be read as 0000 when crossing the page boundary.
FWIW, the actual address for 0x1ec0002 in libxul.so is 0xd0212002 per the crash report, and /proc/self/maps for the crashing process says: ce66f000-d097c000 r-xp 00107000 fd:01 3326050 /data/data/org.mozilla.firefox/cache/libxul.so So there's nothing weird like something else having mapped an empty page at that address or something.
Re-triaging per https://bugzilla.mozilla.org/show_bug.cgi?id=1473195 Needinfo :susheel if you think this bug should be re-triaged.
Priority: -- → P5
This bug has accumulated over 33K in crashes on release.
Flags: needinfo?(sdaswani)
Removing my NI until I hear that it's something the Softvision folks should pick up. My guess is the platform team may want to pick this up? (This was the hint that it's a platform issue: "This is Thumb-2 code." :) )
Flags: needinfo?(sdaswani)
Adding a similar signature which showed up with 419 crashes/338 installations. Here are some comments in those crashes: * was searching on air bnb when the crashed window opened. * reading Yahoo mail. switched to page recommended in mail from Amazon. Crash * I tried to open up a donation page for Color of Change. * sharing on fb * opening a new tab ni on :dbolter since I don't know who on the platform side could look at this.
Crash Signature: [@ _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute] → [@ _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute] [@ _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute::h529bb9ff4d49fc95]
Flags: needinfo?(dbolter)
changing ni to :snorp based on channel meeting discussion.
Flags: needinfo?(dbolter) → needinfo?(snorp)
jchen is currently doing the wrangling with Qualcomm on this and other bugs, assigning to him.
Assignee: nobody → nchen
Flags: needinfo?(snorp)
Priority: P5 → --
Similar to Bug 1468541, these signatures appear so far to be gone entirely in 61.0.2 which shipped last week (and had no related changes).
Assignee: nchen → nobody

Did we get any response from Qualcomm on this? We're seeing an increase in occurrences of bug 1472526.

(In reply to Kevin Jacobs [:kjacobs] from comment #20)

Did we get any response from Qualcomm on this? We're seeing an increase in occurrences of bug 1472526.

Comment 18 mentions jchen was working on this, but he has since left Mozilla. Not sure who, if anyone, followed up with QC regarding this.

I've emailed the Mozilla/Qualcomm mailing list to ask for help with this issue.

Both of these signatures appear to no longer be crashing (the first one only crashes in 61.0). I recommend we close this one out since the signature may have shifted.

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.