[ARM/ARM64] Crash in js::jit::MaybeEnterJit (SIGILL/SIGSEG on Xperia X/Xperia Z5/Galaxy S6/Galaxy S2 Tab)
Categories
(Core :: JavaScript Engine: JIT, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | wontfix |
firefox-esr68 | - | wontfix |
firefox-esr102 | --- | affected |
firefox60 | --- | wontfix |
firefox61 | --- | wontfix |
firefox62 | --- | wontfix |
firefox63 | --- | wontfix |
firefox65 | --- | wontfix |
firefox66 | --- | wontfix |
firefox67 | --- | wontfix |
firefox67.0.1 | --- | wontfix |
firefox68 | --- | wontfix |
firefox69 | --- | wontfix |
firefox70 | --- | wontfix |
firefox76 | --- | wontfix |
firefox77 | --- | wontfix |
firefox78 | --- | wontfix |
firefox111 | --- | affected |
firefox112 | --- | affected |
firefox113 | --- | affected |
People
(Reporter: tcampbell, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: crash, stalled)
Crash Data
Attachments
(2 files, 1 obsolete file)
Reporter | ||
Comment 1•7 years ago
|
||
Reporter | ||
Updated•7 years ago
|
Comment 2•7 years ago
|
||
Comment 3•7 years ago
|
||
Reporter | ||
Comment 5•7 years ago
|
||
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 8•6 years ago
|
||
Kannan did some analysis of some arm64 crashdumps in https://bugzilla.mozilla.org/show_bug.cgi?id=1550525#c16
Updated•6 years ago
|
Reporter | ||
Comment 10•6 years ago
|
||
This issue is showing up under a few other signatures due to how the dumps are processed. We are seeing this high Sony <-> SIGILL correlation on both 32-bit and 64-bit ARM.
The most commonly affected boards (which are primarily, but not exclusively used by Sony) are MSM8952/MSM8956/MSM8994.
Note: Crashes for model F5321 list board as MSM8952, while product info I find suggests it should be MSM8956 (hex-core).
These QualComm Snapdragon devices are both in BIG.little 2+4 or 4+4 configurations. The little cores are Cortex-A53 cores and the big cores are either Cortex-A53 or A72.
In Bug 1521158, we tried to address some cache invalidation issues, but it doesn't seem to have fixed everything.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 11•6 years ago
|
||
This continues to be the #2 top crash in Fennec 68 release (js::jit::MaybeEnterJit signature). Besides some Sony devices, Samsung Galaxy Tab S2 is one of the top crashing devices.
Comment 12•6 years ago
|
||
I think we should revisit this bug to see if we can do anything to help the crashes. Would it help to do some outreach to Sony?
SprintLiteral signature alone has now eclipsed 42502 crashes on release. One of the jit signatures has over 16K in 68 release. Considering we only collect a percentage of crashes, I feel this is quite high.
Reporter | ||
Comment 13•6 years ago
|
||
The feeling is this is still some complex CPU interaction. I'm doing some processing of crashes today to try and identify any particular patterns.
As of today the crashes are 95+% ARM64, which does not match my recollection of what it used to be. I'm attributing this to us now shipping arm64 fennec binaries on the play store and people automatically being converted over.
One guess we could try is to apply the existing Samsung Galaxy S6 workaround to these devices. That involved flushing the icache twice and did improve crash rate for those devices. This would be more of a shot in the dark than anything else. https://searchfox.org/mozilla-central/rev/9775cca0a10a9b5c5f4e15c8f7b3eff5bf91bbd0/js/src/jit/arm/Architecture-arm.cpp#256-262
Reporter | ||
Comment 14•6 years ago
|
||
Here is the faulting instruction for a sample of 100 FennecAndroid SIGILL crashes last week. Overall, the majority of code looks well formed and similar to the type of code we generate. This instruction data is what the minidump captures after the crash and strongly suggests that there is an icache synchronization issue since the cpu disagrees with memory.
Samples such as |cmp w2, #0x3e8| indicate at least some of these crashes are in Baseline mainline code (warm-to-ion check) which is our simplest code generation process and is single threaded.
Reporter | ||
Comment 15•6 years ago
|
||
Of that sample set, 90+% of crashes are big/little configurations with either 6 (2+4) or 8 (4+4) cores.
Top crashing chipsets are:
MSM8956 (6-core, Qualcomm, used in Sony devices and others)
MSM8994 (8-core, Qualcomm, used in Sony devices and others)
Exynos7420 (Samsung. Previously has been a big source of icache-related crashes. We now double-flush and it helped with the numbers)
If I look at non-SIGILL crashes, I don't see any of these in even the top-15 crashing chipsets. https://crash-stats.mozilla.com/search/?reason=%21%3DSIGILL%20%2F%20ILL_ILLOPC&product=FennecAndroid&version=68.0&date=%3E%3D2019-08-05T15%3A32%3A00.000Z&date=%3C2019-08-12T15%3A32%3A00.000Z&_facets=android_board&_sort=-date#facet-android_board
Reporter | ||
Comment 16•6 years ago
|
||
(Adding signature that some of these crashes manifested as in 67)
Reporter | ||
Comment 17•6 years ago
|
||
Querying crashstats API for all FennecAndroid SIGILL crashes -- to avoid concerns about bad signatures -- shows the crash rate doubles for FF68. This increase in crash rate is due to these same unstable phones migrating from ARM32 to ARM64 builds as we officially released ARM64 FF68 to Play Store.
Comment 19•6 years ago
|
||
Adding [fennec68?]
whiteboard tag so we track this top crash for Fennec ESR 68.
Updated•6 years ago
|
Reporter | ||
Comment 20•6 years ago
|
||
Update graph that breaks down into different chipsets that we are interested in. This is a stacked graph. With the release of FF68 (and the switch to ARM64) builds, there was a slight jump in in Exynos7420 (Galaxy S6) crashes and a huge jump in MSM8994/8956 (Xperia Z5, Xperia X) crashes. The baseline of SIGILL crashes on all other phones was not noticeably effected.
Reporter | ||
Comment 21•6 years ago
•
|
||
(Edit: Merge-conflict..)
Reporter | ||
Comment 22•6 years ago
|
||
We are reaching out to ARM for ideas here. We've also been comparing our implementation with v8 for any hints. I have two Xperia Z5s to test with, but no specific STR beyond "browse the web for a few hours".
These crashes are limited to specific older (~2015 era) phones. While impact is high for those users (daily crashes), it seems that is has been that way for several years. So far there is no evidence that this issue is a problem for any newer/upcoming phones. It would still be good to understand this issue but it's impact will also decrease over time.
Updated•6 years ago
|
Comment 23•6 years ago
|
||
Bug 1575153 is simplifying our icache flushing code a lot. I don't think it will fix or affect this, but mentioning this here in case we notice any changes.
Comment 24•6 years ago
|
||
Ted - curious is we ever got any response from ARM. Although these may be older devices, the sheer volume of just one of these signatures is pretty massive - 198919 crashes over 6 months. In one week SprintLiteral has 9884 crashes on the current Fennec release 68.2.0.
Reporter | ||
Comment 25•6 years ago
|
||
We discussed things with an ARM engineer and we aren't doing anything obviously wrong. I've been trying to repro on a pair of Xperia Z5c devices but am not having much success. In the past I experienced these crashes in general web browsing over the week which isn't easy to replicate in a targetted way.
A next step might be put together a build with a testing function (in C++) to stress test the code linking more directly. This might take a bit more time to get something useful.
Updated•6 years ago
|
Comment 26•5 years ago
|
||
I still see some of these signatures showing up with Fenix builds, so Fennec's upcoming EOL isn't going to save us here. Is there anything else we can do to try to get some traction on this bug?
Reporter | ||
Comment 27•5 years ago
|
||
Looking at the crash data from a few different angles to refresh my understanding:
*Only focusing on ARM64 Fennec + Fenix crashes in last 14 day window.
- All crashes: 73% Fennec / 27% Fenix
- All crashes: 16% SIGILL
- All crashes: 15%
SprintfLiteral<T>
- SIGILL crashes: 99% Fennec / 1% Fenix
- SIGILL crashes: 80%
SprintfLiteral<T>
- SIGILL crashes: 50% msm8952 / 20% msm8994
- MSM8952/MSM8994 crashes: 99% Fennec / 1% Fenix
- MSM8952/MSM8994 Fenix crashes: 10% SIGILL
- Non-MSM8952/MSM8994 Fenix crashes: 0% SIGILL
- Fenix crashes: 4% MSM8952/MSM8994
Summary:
- On Fennec, this bug is 80% of MSM8952/MSM8994 crashes and 15% of all Fennec crashes
- On Fenix, this bug is 10% of MSM8952/MSM8994 crashes and 0.4% of all Fenix crashes
Note 1: SIGILL crashes are the primary indicator for this bug.
Note 2: SprintfLiteral<T>
is how this crash is often reported when the stack-walker makes a guess about signature. In reality the crash is in dynamically generated JIT code.
Note 3: MSM8952 / MSM8994 are Qualcomm SoCs from ~2015 that were used in Sony Xperia Z5 / Xperia X phones as well as others.
Reporter | ||
Comment 28•5 years ago
|
||
This issue seems to still track the same set of devices as before. The situation is dramatically better for Fenix. Users still running on this device see 1.1x crash rate on Fenix, while on Fennec it was 5x.
We still don't having any great paths to debug this further so this issue is probably still stalled. It would still be a large time commitment to make further progress on this bug.
Comment 29•5 years ago
|
||
Crash is possible for Windows 8.1 (64bit) with Thunderbird release 78.3.1 (32bit), too.
happend as PDF-Attachement where opened.
crash bp-ec837137-c4d8-48a7-b20b-034570201005
Thunderbird 78.3.1 Crash Report [@ js::jit::MaybeEnterJit ]
top 10 entries of stack trace
0 @0x2ee84092 context
1 @0x2ee20de8 frame_pointer
2 @0x2a7c944f frame_pointer
3 @0x2edb08aa frame_pointer
4 xul.dll js::jit::MaybeEnterJit(JSContext*, js::RunState&) js/src/jit/Jit.cpp:196 frame_pointer
5 xul.dll js::RunScript(JSContext*, js::RunState&) js/src/vm/Interpreter.cpp:450 cfi
6 xul.dll js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) js/src/vm/Interpreter.cpp:620 cfi
7 xul.dll js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) js/src/vm/Interpreter.cpp:665 cfi
8 xul.dll js::CallSetter(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::Handle<JS::Value>) js/src/vm/Interpreter.cpp:803 cfi
9 xul.dll SetExistingProperty(JSContext*, JS::Handle<JS::PropertyKey>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::Handle<js::NativeObject*>, JS::Handle<JS::PropertyResult>, JS::ObjectOpResult&) js/src/vm/NativeObject.cpp:2809 cfi
10 xul.dll js::NativeSetProperty<js::Qualified>(JSContext*, JS::Handle<js::NativeObject*>, JS::Handle<JS::PropertyKey>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::ObjectOpResult&) js/src/vm/NativeObject.cpp:2838 cfi
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 31•4 years ago
|
||
This crash seems to have grown again with a few new signatures. Nearly all the excess crashes are Linux 3.10.84 kernels with big.LITTLE Qualcomm CPU configurations.
Also in the mix now is the Samsung Galaxy Tab S2 (with board == msm8976), also from the ~2015 era.
Reporter | ||
Updated•4 years ago
|
Updated•3 years ago
|
Comment 32•3 years ago
|
||
The bug is linked to topcrash signatures, which match the following criteria:
- Top 20 desktop browser crashes on release (startup)
- Top 10 content process crashes on release
- Top 20 desktop browser crashes on beta (startup)
- Top 10 content process crashes on beta
- Top 5 desktop browser crashes on Linux on release (startup)
- Top 10 AArch64 and ARM crashes on release (startup)
- Top 10 AArch64 and ARM crashes on nightly
For more information, please visit auto_nag documentation.
Updated•3 years ago
|
Comment 33•3 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Comment 34•3 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Comment 35•3 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 content process crashes on beta
For more information, please visit auto_nag documentation.
Comment 36•3 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Reporter | ||
Comment 37•2 years ago
|
||
Updated summary:
- Affected "android board" devices: msm8952, msm8976, universal7420, msm8994
- These chipsets are still all from 2015
- There are about 140 crashes per day (in crash stats) for these issues
- These users experience 40% more crashes than a Fenix user on a different device due to this issue
- Past efforts to target this sort of hardware/kernel issue have struggled to generate meaningful results
- Fenix continues to be less impacted than Fennec, but reason is unclear
- The 'crash data' is showing all JIT crashes for all H/W and all reasons, so it is much larger than the specific issue this bug is tracking
Dropping severity to S3 since this is simply a slightly elevated level of random crashes and reloading tabs will get people back on track. Also marking stalled since we don't have any new leads on chasing these sorts of random hardware issues.
Description
•