Closed
Bug 1470925
Opened 7 years ago
Closed 5 years ago
Crash in _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute
Categories
(Firefox for Android Graveyard :: General, defect)
Tracking
(firefox61 affected)
RESOLVED
WORKSFORME
| Tracking | Status | |
|---|---|---|
| firefox61 | --- | affected |
People
(Reporter: philipp, Unassigned)
References
Details
(Keywords: crash)
Crash Data
This bug was filed from the Socorro interface and is
report bp-6464c585-dfa0-4143-a3f4-7668e0180625.
=============================================================
Top 10 frames of crashing thread:
0 libxul.so _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute::h529bb9ff4d49fc95 src/libcore/ptr.rs:223
1 libxul.so rayon_core::registry::WorkerThread::wait_until::hb70f59fb5bf38fdf third_party/rust/rayon-core/src/job.rs:60
2 libxul.so rayon_core::scope::scope::_$u7b$$u7b$closure$u7d$$u7d$::hebfae7f4f232d14b third_party/rust/rayon-core/src/scope/mod.rs:392
3 libxul.so rayon_core::thread_pool::ThreadPool::install::_$u7b$$u7b$closure$u7d$$u7d$::h1831231b9f6c5ff7 third_party/rust/rayon-core/src/registry.rs:713
4 libxul.so _$LT$rayon_core..job..StackJob$LT$L$C$$u20$F$C$$u20$R$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute::h1e78b059d9bca3a9 third_party/rust/rayon-core/src/registry.rs:355
5 libxul.so rayon_core::registry::WorkerThread::wait_until::hb70f59fb5bf38fdf third_party/rust/rayon-core/src/job.rs:60
6 libxul.so std::sys_common::backtrace::__rust_begin_short_backtrace::h762fe3dc5518053a third_party/rust/rayon-core/src/registry.rs:674
7 libxul.so _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h22c1df5e1d933032 src/libstd/thread/mod.rs:406
8 libxul.so std::sys::unix::thread::Thread::new::thread_start::h38f4e2017be83d0f src/liballoc/boxed.rs:825
9 libc.so libc.so@0x47483
=============================================================
this crash signature is popping up in the 61 release build for fennec - it's only occurring on devices with snapdragon 820 cpus.
perhaps it's a build specific issue, like we have seen a number of times in the past with bug 1434994, bug 1454639, bug 1451741 et al.
Comment 1•7 years ago
|
||
This is the #5 topcrash on Fennec 61 on release and like our #1 crash, bug 1468541, it's all on Snapdragon 820 CPUs. Don't know if they're directly related, but something sure seems off here.
See Also: → 1468541
Comment 3•7 years ago
|
||
The crash here is in Rayon, which is the standard threading library in Rust. It's not in style system code per se.
So the two most likely possibilities are that:
(1) Rayon is making some kind of architectural assumption that occasionally breaks on Snapdragon 820s, or
(2) We have another bug elsewhere in the system that is corrupting Rayon's memory.
The crash reports here are remarkably consistent, in that they're always a segfault at 0x40 at [1]. Given the various magic about repr(simd) above, I do wonder whether there's something architecture-specific going wrong with this code. Niko, can you have a look? Feel free to delegate to another expert on this code if needed.
[1] https://github.com/rust-lang/rust/blob/4d90ac38c0b61bb69470b61ea2cccea0df48d9e5/src/libcore/ptr.rs#L223
Flags: needinfo?(bobbyholley) → needinfo?(nmatsakis)
Comment 4•7 years ago
|
||
We are approaching 4000K crashes on 61 release. Will Niko be able to look at this soon? Thanks.
Comment 5•7 years ago
|
||
From irc analysis by eddyb, this is what the code around the crash site looks like:
1ebff60: b5f0 push {r4, r5, r6, r7, lr}
1ebff62: af03 add r7, sp, #12
1ebff64: e92d 0f00 stmdb sp!, {r8, r9, sl, fp}
1ebff68: f5ad 7d6f sub.w sp, sp, #956 ; 0x3bc
1ebff6c: 466c mov r4, sp
1ebff6e: f36f 0404 bfc r4, #0, #5
1ebff72: 46a5 mov sp, r4
1ebff74: 69c1 ldr r1, [r0, #28]
1ebff76: f04f 0e00 mov.w lr, #0
1ebff7a: 9118 str r1, [sp, #96] ; 0x60
1ebff7c: 6981 ldr r1, [r0, #24]
1ebff7e: 9117 str r1, [sp, #92] ; 0x5c
1ebff80: 6941 ldr r1, [r0, #20]
1ebff82: 9114 str r1, [sp, #80] ; 0x50
1ebff84: 6901 ldr r1, [r0, #16]
1ebff86: 9113 str r1, [sp, #76] ; 0x4c
1ebff88: 6881 ldr r1, [r0, #8]
1ebff8a: 6805 ldr r5, [r0, #0]
1ebff8c: 6844 ldr r4, [r0, #4]
1ebff8e: 9112 str r1, [sp, #72] ; 0x48
1ebff90: 2d00 cmp r5, #0
1ebff92: f8d0 900c ldr.w r9, [r0, #12]
1ebff96: f8c0 e01c str.w lr, [r0, #28]
1ebff9a: f8c0 e018 str.w lr, [r0, #24]
1ebff9e: f8c0 e014 str.w lr, [r0, #20]
1ebffa2: f8c0 e010 str.w lr, [r0, #16]
1ebffa6: f8c0 e00c str.w lr, [r0, #12]
1ebffaa: f8c0 e008 str.w lr, [r0, #8]
1ebffae: f8c0 e004 str.w lr, [r0, #4]
1ebffb2: f8c0 e000 str.w lr, [r0]
1ebffb6: 6bc1 ldr r1, [r0, #60] ; 0x3c
1ebffb8: 9119 str r1, [sp, #100] ; 0x64
1ebffba: 6b81 ldr r1, [r0, #56] ; 0x38
1ebffbc: 911b str r1, [sp, #108] ; 0x6c
1ebffbe: 6b41 ldr r1, [r0, #52] ; 0x34
1ebffc0: 911c str r1, [sp, #112] ; 0x70
1ebffc2: 6b01 ldr r1, [r0, #48] ; 0x30
1ebffc4: 911d str r1, [sp, #116] ; 0x74
1ebffc6: 6a01 ldr r1, [r0, #32]
1ebffc8: 911f str r1, [sp, #124] ; 0x7c
1ebffca: 6a41 ldr r1, [r0, #36] ; 0x24
1ebffcc: 911e str r1, [sp, #120] ; 0x78
1ebffce: 6a81 ldr r1, [r0, #40] ; 0x28
1ebffd0: 9116 str r1, [sp, #88] ; 0x58
1ebffd2: 6ac1 ldr r1, [r0, #44] ; 0x2c
1ebffd4: 9115 str r1, [sp, #84] ; 0x54
1ebffd6: f8c0 e03c str.w lr, [r0, #60] ; 0x3c
1ebffda: f8c0 e038 str.w lr, [r0, #56] ; 0x38
1ebffde: f8c0 e034 str.w lr, [r0, #52] ; 0x34
1ebffe2: f8c0 e030 str.w lr, [r0, #48] ; 0x30
1ebffe6: f8c0 e02c str.w lr, [r0, #44] ; 0x2c
1ebffea: f8c0 e028 str.w lr, [r0, #40] ; 0x28
1ebffee: f8c0 e024 str.w lr, [r0, #36] ; 0x24
1ebfff2: f8c0 e020 str.w lr, [r0, #32]
1ebfff6: f8d0 805c ldr.w r8, [r0, #92] ; 0x5c
1ebfffa: 6d86 ldr r6, [r0, #88] ; 0x58
1ebfffc: 6d42 ldr r2, [r0, #84] ; 0x54
1ebfffe: f8d0 c050 ldr.w ip, [r0, #80] ; 0x50
1ec0002: f8d0 a040 ldr.w sl, [r0, #64] ; 0x40
1ec0006: f8d0 b044 ldr.w fp, [r0, #68] ; 0x44
1ec000a: 6c81 ldr r1, [r0, #72] ; 0x48
1ec000c: 6cc3 ldr r3, [r0, #76] ; 0x4c
1ec000e: f8c0 e05c str.w lr, [r0, #92] ; 0x5c
1ec0012: f8c0 e058 str.w lr, [r0, #88] ; 0x58
1ec0016: f8c0 e054 str.w lr, [r0, #84] ; 0x54
1ec001a: f8c0 e050 str.w lr, [r0, #80] ; 0x50
1ec001e: f8c0 e04c str.w lr, [r0, #76] ; 0x4c
1ec0022: f8c0 e048 str.w lr, [r0, #72] ; 0x48
1ec0026: f8c0 e044 str.w lr, [r0, #68] ; 0x44
1ec002a: 900e str r0, [sp, #56] ; 0x38
1ec002c: f8c0 e040 str.w lr, [r0, #64] ; 0x40
The crash site is 1ec0002, so the crashing instruction is:
1ec0002: f8d0 a040 ldr.w sl, [r0, #64] ; 0x40
with r0 being 0, that's a write to 0x40, which is consistent with the crash address.
The preceding instruction is:
1ebfffe: f8d0 c050 ldr.w ip, [r0, #80] ; 0x50
There is no way this can have succeeded with the same value of r0, and eddyb says it looks like nothing is jumping there.
It's worth noting that the instruction is a) not aligned, and b) on a page boundary.
One thing that I can imagine going wrong here is if a page fault is happening and r0 is not restored to the right value. But that would be pretty bad and would presumably trigger more crashes than just this.
Comment 6•7 years ago
|
||
<nagisa> nah, I’ll just do the armchair debugging for my enjoyment
<nagisa> (I debug such arm crashes professionally for cortex-Ms, and I’d like to think I’ve seen everything :D)
...
<nagisa> yeah, there’s nothing wrong with that code locally, at least I can’t see anything
<nagisa> In a no-OS/embedded context the first suspect would be an interrupt corrupting its return stack.
<nagisa> but since this is android, my two wild shots are: 1) the thread that was executhing this code was deschedulled and some other thread overwrote parts of the stack of this function 2) a signal handler executed and overwirtten part of the stack for this function
<nagisa> something like that, anyway
Comment 7•7 years ago
|
||
snorp is investigating this with eddyb and nagisa on #rustc, NI him here to log their conclusions.
Flags: needinfo?(nmatsakis) → needinfo?(snorp)
I don't think we came to any conclusions other than "looks like stack/register corruption". Still no idea what's happening.
Flags: needinfo?(snorp)
Comment 9•7 years ago
|
||
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #8)
> I don't think we came to any conclusions other than "looks like
> stack/register corruption". Still no idea what's happening.
There were some very specific investigative observations and conclusions in the IRC log, even if the root cause was still unknown. It seems worth logging those here, or at least linking to them for reference by anyone else who picks this up.
Comment 10•7 years ago
|
||
Here's random theory, also after looking at bug 1472526. It is most likely
a total red herring, but anyway:
This is Thumb-2 code. Thumb-2 has a conditionalisation model in which the
processor (at least conceptually) has a 4-entry shift register, ITSTATE.
Each entry is a guarding condition for the instruction, a la traditional
ARM32 (EQ, NE, etc). For each instruction, a condition is shifted out of
the register and used. Unless the program modifies ITSTATE explicitly, the
vacated spaces are filled with "AL" (always-execute) condition codes.
At the crash site, we expect ITSTATE to be [AL,AL,AL,AL] since there is no
preceding insn which sets it otherwise.
Now I wonder if the presence of the page boundary has caused ITSTATE to
become corrupted somehow. Either by a hardware bug, or by a kernel bug as a
result of a trip into the kernel to service a fault on the
just-about-to-be-entered code page. Either way, that might just explain how
the second instruction fails yet the first doesn't fail -- because the
associated corrupted ITSTATE entry causes it not to get executed.
Comment 11•7 years ago
|
||
I think what's going on here has to do with the instruction at 0x1ebfffe,
> 1ebfffe: f8d0 c050 ldr.w ip, [r0, #80] ; 0x50
If the second part of that instruction is read as 0000 (because of the page boundary), the instruction becomes,
> 1ebfffe: f8d0 0000 ldr.w r0, [r0]
We know r0 is likely a valid pointer at this point, but it's reasonable to think that [r0] is 0. So r0 becomes 0 at this point, and the next instruction causes the crash at 0x40,
> 1ec0002: f8d0 a040 ldr.w sl, [r0, #64] ; 0x40
Bug 1472526 can be explained the same way.
However, I have no idea what's causing the second part of the thumb-2 instruction to be read as 0000 when crossing the page boundary.
Comment 12•7 years ago
|
||
FWIW, the actual address for 0x1ec0002 in libxul.so is 0xd0212002 per the crash report, and /proc/self/maps for the crashing process says:
ce66f000-d097c000 r-xp 00107000 fd:01 3326050 /data/data/org.mozilla.firefox/cache/libxul.so
So there's nothing weird like something else having mapped an empty page at that address or something.
Comment 13•7 years ago
|
||
Re-triaging per https://bugzilla.mozilla.org/show_bug.cgi?id=1473195
Needinfo :susheel if you think this bug should be re-triaged.
Priority: -- → P5
Comment 14•7 years ago
|
||
This bug has accumulated over 33K in crashes on release.
Flags: needinfo?(sdaswani)
Comment 15•7 years ago
|
||
Removing my NI until I hear that it's something the Softvision folks should pick up. My guess is the platform team may want to pick this up? (This was the hint that it's a platform issue: "This is Thumb-2 code." :) )
Flags: needinfo?(sdaswani)
Comment 16•7 years ago
|
||
Adding a similar signature which showed up with 419 crashes/338 installations. Here are some comments in those crashes:
* was searching on air bnb when the crashed window opened.
* reading Yahoo mail. switched to page recommended in mail from Amazon. Crash
* I tried to open up a donation page for Color of Change.
* sharing on fb
* opening a new tab
ni on :dbolter since I don't know who on the platform side could look at this.
Crash Signature: [@ _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute] → [@ _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute]
[@ _$LT$rayon_core..job..HeapJob$LT$BODY$GT$$u20$as$u20$rayon_core..job..Job$GT$::execute::h529bb9ff4d49fc95]
Flags: needinfo?(dbolter)
Comment 17•7 years ago
|
||
changing ni to :snorp based on channel meeting discussion.
Flags: needinfo?(dbolter) → needinfo?(snorp)
jchen is currently doing the wrangling with Qualcomm on this and other bugs, assigning to him.
Assignee: nobody → nchen
Flags: needinfo?(snorp)
Updated•7 years ago
|
Priority: P5 → --
Comment 19•7 years ago
|
||
Similar to Bug 1468541, these signatures appear so far to be gone entirely in 61.0.2 which shipped last week (and
had no related changes).
Updated•7 years ago
|
Assignee: nchen → nobody
Comment 20•6 years ago
|
||
Did we get any response from Qualcomm on this? We're seeing an increase in occurrences of bug 1472526.
Comment 21•6 years ago
|
||
(In reply to Kevin Jacobs [:kjacobs] from comment #20)
Did we get any response from Qualcomm on this? We're seeing an increase in occurrences of bug 1472526.
Comment 18 mentions jchen was working on this, but he has since left Mozilla. Not sure who, if anyone, followed up with QC regarding this.
Comment 22•6 years ago
|
||
I've emailed the Mozilla/Qualcomm mailing list to ask for help with this issue.
Comment 23•6 years ago
|
||
Both of these signatures appear to no longer be crashing (the first one only crashes in 61.0). I recommend we close this one out since the signature may have shifted.
Comment 24•5 years ago
|
||
Closing because no crashes reported for 12 weeks.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
| Assignee | ||
Updated•5 years ago
|
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•