Open Bug 1461724 Opened 7 years ago Updated 2 years ago

[ARM/ARM64] Crash in js::jit::MaybeEnterJit (SIGILL/SIGSEG on Xperia X/Xperia Z5/Galaxy S6/Galaxy S2 Tab)

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

Unspecified
Android
defect

Tracking

()

Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 - wontfix
firefox-esr102 --- affected
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox67.0.1 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 --- wontfix
firefox76 --- wontfix
firefox77 --- wontfix
firefox78 --- wontfix
firefox111 --- affected
firefox112 --- affected
firefox113 --- affected

People

(Reporter: tcampbell, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, stalled)

Crash Data

Attachments

(2 files, 1 obsolete file)

https://crash-stats.mozilla.org/signature/?cpu_arch=arm&signature=js%3A%3Ajit%3A%3AMaybeEnterJit&date=%3E%3D2018-02-15T11%3A24%3A52.000Z&date=%3C2018-05-15T12%3A24%3A52.000Z#graphs If you graph by |android manufacturer|, Sony phones have a significantly higher crash rate. Half of the crashes are SIGILL and half are SIGSEG. They may be using an unusual kernel configuration.
Lars, did you mention there is some funniness about ARM alignment rules depending on the OS?
Flags: needinfo?(lhansen)
Crash Signature: [@ js::jit::MaybeEnterJit]
Oh, way cool. Will try to dig a little deeper. Although, for alignment we really should expect SIGBUS. Alignment discussion is on bug 1447577.
Generally I advice plotting this over six months (as the graph above shows). Then you start to see the Sony curve bend sharply upward around late January this year. But, almost no matter what you look at the graphs suddenly turn upward at that time. So the most plausible explanation is that there is a general and fairly dramatic increase in crashes with Fennec 58 which released on Jan 23. The correlation with Sony phones is undeniable, but it's hard to filter for that without knowing the user population by phone model. (If we assume that the uptick coincides with the release of a new phone model, it would be the Xperia L2: https://www.gsmarena.com/sony_xperia_l2-8987.php. It's listed with a Quad Cortex-A53, ie, an ARMv8, part of a Mediatek MT6737 SoC, which is popular in phones you've never heard about: https://www.kimovil.com/en/list-smartphones-by-processor/mediatek-mt6737.) If you want to drive yourself crazy, try plotting by CPU count - 4 and 6 cores crash like crazy; 8 and 10 cores have few, and 1, 2, and 3 have virtually no crashes. But 4 and 6 cores are typical on phones, the other cpu counts are outliers, and that probably explains that phenomenon. This doesn't even point to MT problems per se, just to the kinds of phones people have.
Ehrm.. Even if you extend the graph timeline, you can see how 100% of the "above baseline" crashes came from msm8952 and msm8994 boards. According to model numbers then, those in turn corresponded to Xperia Z5 (compact) and Xperia X (compact)
The signature used to be |EnterBaseline| and I see Sony stand out there too. I'm not sure anything has changed this year and this seems like an older issue. Aggregating on |android model|, I see the following top models and chipsets: > F5321 Sony Xperia X Compact Qualcomm MSM8956 > E5823 Sony Xperia Z5 Compact Qualcomm MSM8994 > F5121 Sony Xperia X Qualcomm MSM8956 > E6653 Sony Xperia Z5 Qualcomm MSM8994 > Redmi Note 3 Qualcomm MSM8956 > SO-02J Sony Xperia X Compact (JP) Qualcomm MSM8956 > KFSUWI Amazon Fire HD 10 MediaTek > Nexus 5X Qualcomm MSM8992 > Redmi Note 4 Qualcomm MSM8953 > E6853 Sony Xperia Z5 Premium Qualcomm MSM8994
Flags: needinfo?(lhansen)
This is the #5 top browser crash on 60.0.2 at the moment.
Keywords: topcrash
Severity: major → critical
See Also: → 1550525
Summary: [ARM] Crash in js::jit::MaybeEnterJit (SIGILL/SIGSEG on Sony phones) → [ARM/ARM64] Crash in js::jit::MaybeEnterJit (SIGILL/SIGSEG on Sony phones)

Kannan did some analysis of some arm64 crashdumps in https://bugzilla.mozilla.org/show_bug.cgi?id=1550525#c16

Crash Signature: [@ js::jit::MaybeEnterJit] → [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache]

This issue is showing up under a few other signatures due to how the dumps are processed. We are seeing this high Sony <-> SIGILL correlation on both 32-bit and 64-bit ARM.

The most commonly affected boards (which are primarily, but not exclusively used by Sony) are MSM8952/MSM8956/MSM8994.
Note: Crashes for model F5321 list board as MSM8952, while product info I find suggests it should be MSM8956 (hex-core).

These QualComm Snapdragon devices are both in BIG.little 2+4 or 4+4 configurations. The little cores are Cortex-A53 cores and the big cores are either Cortex-A53 or A72.

In Bug 1521158, we tried to address some cache invalidation issues, but it doesn't seem to have fixed everything.

Crash Signature: [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] → [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache]
Crash Signature: [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] → [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>]
Crash Signature: [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>] → [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>]

This continues to be the #2 top crash in Fennec 68 release (js::jit::MaybeEnterJit signature). Besides some Sony devices, Samsung Galaxy Tab S2 is one of the top crashing devices.

I think we should revisit this bug to see if we can do anything to help the crashes. Would it help to do some outreach to Sony?

SprintLiteral signature alone has now eclipsed 42502 crashes on release. One of the jit signatures has over 16K in 68 release. Considering we only collect a percentage of crashes, I feel this is quite high.

The feeling is this is still some complex CPU interaction. I'm doing some processing of crashes today to try and identify any particular patterns.

As of today the crashes are 95+% ARM64, which does not match my recollection of what it used to be. I'm attributing this to us now shipping arm64 fennec binaries on the play store and people automatically being converted over.

One guess we could try is to apply the existing Samsung Galaxy S6 workaround to these devices. That involved flushing the icache twice and did improve crash rate for those devices. This would be more of a shot in the dark than anything else. https://searchfox.org/mozilla-central/rev/9775cca0a10a9b5c5f4e15c8f7b3eff5bf91bbd0/js/src/jit/arm/Architecture-arm.cpp#256-262

Attached file Faulting instructions

Here is the faulting instruction for a sample of 100 FennecAndroid SIGILL crashes last week. Overall, the majority of code looks well formed and similar to the type of code we generate. This instruction data is what the minidump captures after the crash and strongly suggests that there is an icache synchronization issue since the cpu disagrees with memory.

Samples such as |cmp w2, #0x3e8| indicate at least some of these crashes are in Baseline mainline code (warm-to-ion check) which is our simplest code generation process and is single threaded.

Of that sample set, 90+% of crashes are big/little configurations with either 6 (2+4) or 8 (4+4) cores.

Top crashing chipsets are:
MSM8956 (6-core, Qualcomm, used in Sony devices and others)
MSM8994 (8-core, Qualcomm, used in Sony devices and others)
Exynos7420 (Samsung. Previously has been a big source of icache-related crashes. We now double-flush and it helped with the numbers)

If I look at non-SIGILL crashes, I don't see any of these in even the top-15 crashing chipsets. https://crash-stats.mozilla.com/search/?reason=%21%3DSIGILL%20%2F%20ILL_ILLOPC&product=FennecAndroid&version=68.0&date=%3E%3D2019-08-05T15%3A32%3A00.000Z&date=%3C2019-08-12T15%3A32%3A00.000Z&_facets=android_board&_sort=-date#facet-android_board

(Adding signature that some of these crashes manifested as in 67)

Crash Signature: [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>] → [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>] [@ mdb_env_cthr_toggle ]
See Also: → 1539465
Attached image FennecAndroid SIGILL Per Day (obsolete) —

Querying crashstats API for all FennecAndroid SIGILL crashes -- to avoid concerns about bad signatures -- shows the crash rate doubles for FF68. This increase in crash rate is due to these same unstable phones migrating from ARM32 to ARM64 builds as we officially released ARM64 FF68 to Play Store.

Depends on: 1573215

Adding [fennec68?] whiteboard tag so we track this top crash for Fennec ESR 68.

OS: Unspecified → Android
Whiteboard: [fennec68?]
Crash Signature: [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>] [@ mdb_env_cthr_toggle ] → [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>] [@ mdb_env_cthr_toggle ] [@ arena_malloc | dalvik-main space 1 (deleted)@0xbbffffe] [@ dalvik-large object space allocation (deleted)@0x0] [@ dalvik-main…

Update graph that breaks down into different chipsets that we are interested in. This is a stacked graph. With the release of FF68 (and the switch to ARM64) builds, there was a slight jump in in Exynos7420 (Galaxy S6) crashes and a huge jump in MSM8994/8956 (Xperia Z5, Xperia X) crashes. The baseline of SIGILL crashes on all other phones was not noticeably effected.

Attachment #9084769 - Attachment is obsolete: true

(Edit: Merge-conflict..)

Crash Signature: [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>] [@ mdb_env_cthr_toggle ] [@ arena_malloc | dalvik-main space 1 (deleted)@0xbbffffe] [@ dalvik-large object space allocation (deleted)@0x0] [@ dalvik-main… → [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>] [@ mdb_env_cthr_toggle ] [@ arena_malloc | dalvik-main space 1 (deleted)@0xbbffffe] [@ dalvik-large object space allocation (deleted)@0x0] [@ dalvik-m…
Summary: [ARM/ARM64] Crash in js::jit::MaybeEnterJit (SIGILL/SIGSEG on Sony phones) → [ARM/ARM64] Crash in js::jit::MaybeEnterJit (SIGILL/SIGSEG on Xperia X/Xperia Z5/Galaxy S6)

We are reaching out to ARM for ideas here. We've also been comparing our implementation with v8 for any hints. I have two Xperia Z5s to test with, but no specific STR beyond "browse the web for a few hours".

These crashes are limited to specific older (~2015 era) phones. While impact is high for those users (daily crashes), it seems that is has been that way for several years. So far there is no evidence that this issue is a problem for any newer/upcoming phones. It would still be good to understand this issue but it's impact will also decrease over time.

Bug 1575153 is simplifying our icache flushing code a lot. I don't think it will fix or affect this, but mentioning this here in case we notice any changes.

Ted - curious is we ever got any response from ARM. Although these may be older devices, the sheer volume of just one of these signatures is pretty massive - 198919 crashes over 6 months. In one week SprintLiteral has 9884 crashes on the current Fennec release 68.2.0.

Flags: needinfo?(tcampbell)

We discussed things with an ARM engineer and we aren't doing anything obviously wrong. I've been trying to repro on a pair of Xperia Z5c devices but am not having much success. In the past I experienced these crashes in general web browsing over the week which isn't easy to replicate in a targetted way.

A next step might be put together a build with a testing function (in C++) to stress test the code linking more directly. This might take a bit more time to get something useful.

Flags: needinfo?(tcampbell)

I still see some of these signatures showing up with Fenix builds, so Fennec's upcoming EOL isn't going to save us here. Is there anything else we can do to try to get some traction on this bug?

Looking at the crash data from a few different angles to refresh my understanding:

*Only focusing on ARM64 Fennec + Fenix crashes in last 14 day window.

  • All crashes: 73% Fennec / 27% Fenix
  • All crashes: 16% SIGILL
  • All crashes: 15% SprintfLiteral<T>
  • SIGILL crashes: 99% Fennec / 1% Fenix
  • SIGILL crashes: 80% SprintfLiteral<T>
  • SIGILL crashes: 50% msm8952 / 20% msm8994
  • MSM8952/MSM8994 crashes: 99% Fennec / 1% Fenix
  • MSM8952/MSM8994 Fenix crashes: 10% SIGILL
  • Non-MSM8952/MSM8994 Fenix crashes: 0% SIGILL
  • Fenix crashes: 4% MSM8952/MSM8994

Summary:

  • On Fennec, this bug is 80% of MSM8952/MSM8994 crashes and 15% of all Fennec crashes
  • On Fenix, this bug is 10% of MSM8952/MSM8994 crashes and 0.4% of all Fenix crashes

Note 1: SIGILL crashes are the primary indicator for this bug.
Note 2: SprintfLiteral<T> is how this crash is often reported when the stack-walker makes a guess about signature. In reality the crash is in dynamically generated JIT code.
Note 3: MSM8952 / MSM8994 are Qualcomm SoCs from ~2015 that were used in Sony Xperia Z5 / Xperia X phones as well as others.

This issue seems to still track the same set of devices as before. The situation is dramatically better for Fenix. Users still running on this device see 1.1x crash rate on Fenix, while on Fennec it was 5x.

We still don't having any great paths to debug this further so this issue is probably still stalled. It would still be a large time commitment to make further progress on this bug.

Flags: needinfo?(tcampbell)

Crash is possible for Windows 8.1 (64bit) with Thunderbird release 78.3.1 (32bit), too.
happend as PDF-Attachement where opened.

crash bp-ec837137-c4d8-48a7-b20b-034570201005

Thunderbird 78.3.1 Crash Report [@ js::jit::MaybeEnterJit ]

top 10 entries of stack trace

0 @0x2ee84092 context
1 @0x2ee20de8 frame_pointer
2 @0x2a7c944f frame_pointer
3 @0x2edb08aa frame_pointer
4 xul.dll js::jit::MaybeEnterJit(JSContext*, js::RunState&) js/src/jit/Jit.cpp:196 frame_pointer
5 xul.dll js::RunScript(JSContext*, js::RunState&) js/src/vm/Interpreter.cpp:450 cfi
6 xul.dll js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) js/src/vm/Interpreter.cpp:620 cfi
7 xul.dll js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) js/src/vm/Interpreter.cpp:665 cfi
8 xul.dll js::CallSetter(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::Handle<JS::Value>) js/src/vm/Interpreter.cpp:803 cfi
9 xul.dll SetExistingProperty(JSContext*, JS::Handle<JS::PropertyKey>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::Handle<js::NativeObject*>, JS::Handle<JS::PropertyResult>, JS::ObjectOpResult&) js/src/vm/NativeObject.cpp:2809 cfi
10 xul.dll js::NativeSetProperty<js::Qualified>(JSContext*, JS::Handle<js::NativeObject*>, JS::Handle<JS::PropertyKey>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::ObjectOpResult&) js/src/vm/NativeObject.cpp:2838 cfi

Crash Signature: [@ js::jit::MaybeEnterJit] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>] [@ mdb_env_cthr_toggle ] [@ arena_malloc | dalvik-main space 1 (deleted)@0xbbffffe] [@ dalvik-large object space allocation (deleted)@0x0] [@ dalvik-m… → [@ js::jit::MaybeEnterJit] [@ Interpret] [@ js::InternalCallOrConstruct] [@ js::LiveSavedFrameCache::~LiveSavedFrameCache] [@ SprintfLiteral<T>] [@ mdb_env_cthr_toggle ] [@ arena_malloc | dalvik-main space 1 (deleted)@0xbbffffe] [@ dalvik-large obj…

This crash seems to have grown again with a few new signatures. Nearly all the excess crashes are Linux 3.10.84 kernels with big.LITTLE Qualcomm CPU configurations.

Also in the mix now is the Samsung Galaxy Tab S2 (with board == msm8976), also from the ~2015 era.

Summary: [ARM/ARM64] Crash in js::jit::MaybeEnterJit (SIGILL/SIGSEG on Xperia X/Xperia Z5/Galaxy S6) → [ARM/ARM64] Crash in js::jit::MaybeEnterJit (SIGILL/SIGSEG on Xperia X/Xperia Z5/Galaxy S6/Galaxy S2 Tab)

The bug is linked to topcrash signatures, which match the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 10 content process crashes on release
  • Top 20 desktop browser crashes on beta (startup)
  • Top 10 content process crashes on beta
  • Top 5 desktop browser crashes on Linux on release (startup)
  • Top 10 AArch64 and ARM crashes on release (startup)
  • Top 10 AArch64 and ARM crashes on nightly

For more information, please visit auto_nag documentation.

Severity: critical → S2

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 content process crashes on beta

For more information, please visit auto_nag documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Updated summary:

  • Affected "android board" devices: msm8952, msm8976, universal7420, msm8994
  • These chipsets are still all from 2015
  • There are about 140 crashes per day (in crash stats) for these issues
  • These users experience 40% more crashes than a Fenix user on a different device due to this issue
  • Past efforts to target this sort of hardware/kernel issue have struggled to generate meaningful results
  • Fenix continues to be less impacted than Fennec, but reason is unclear
  • The 'crash data' is showing all JIT crashes for all H/W and all reasons, so it is much larger than the specific issue this bug is tracking

Dropping severity to S3 since this is simply a slightly elevated level of random crashes and reloading tabs will get people back on track. Also marking stalled since we don't have any new leads on chasing these sorts of random hardware issues.

Severity: S2 → S3
Keywords: stalled
Priority: P2 → P3
Whiteboard: [fennec68?]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: