Closed Bug 1526396 Opened 10 months ago Closed 4 months ago

Crash in [@ __clear_cache]

Categories

(Firefox for Android :: General, defect, P1, critical)

Unspecified
Android
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox65 --- wontfix
firefox66 --- fix-optional
firefox67 --- fix-optional

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, regression, regressionwindow-wanted, Whiteboard: [geckoview:p1])

Crash Data

This bug is for crash report bp-219c0ca6-89c9-4211-9172-787560190208.

Crash seen as far back as 64: https://bit.ly/2MWPuHR. Filing here since I am not sure whether it not it belongs in JS component or not. Two years ago we had a similar stack, Bug 1354882 in the 55 cycle which came and went.

Some comments:
*Crashes several times a day on Amazon fire tab
*I put the phone down to let dogs out can e back crashed

Does someone in QA have an Amazon Fire tablet to test with?

Top 10 frames of crashing thread:

0 libc.so libc.so@0x36c8d 
1 libxul.so __clear_cache 
2 libxul.so js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:442
3 libxul.so JSString* js::AllocateString<JSString,  js/src/gc/Nursery.cpp:317
4 libxul.so JSString* js::ConcatStrings< js/src/gc/Allocator.h:43
5 libxul.so js::jit::RematerializedFrame::isFunctionFrame const mfbt/Span.h
6 libxul.so __clear_cache 
7 libxul.so Interpret js/src/vm/Interpreter.cpp:593
8 libxul.so __clear_cache 
9 libxul.so __clear_cache 

Flags: needinfo?(ioana.chiorean)

This is a pretty high crash volume for beta. Sorina do you have a Fire tablet?

Flags: needinfo?(sorina.florean)
Priority: -- → P1

Fire tablets are unsupported. Every user of these side loads Firefox and do not get updates.

We do not have - nor in SV (neither in Moz, nor in other projects) neither on Mozilla side of inventory.
And also as Kevin said, we do not have it supported ( at least on our side)

Flags: needinfo?(sorina.florean)
Flags: needinfo?(ioana.chiorean)

Sony Xperia Z5 is the next highest crashing device on one signature, and there are other devices crashing as well. I will try to see if there are other devices we have in our inventory that are in the list.

Mira, Andrei can you have a look at this pls?

Flags: needinfo?(mirabela.lobontiu)
Flags: needinfo?(andrei.bodea)

Hi,
I tested with Sony Xperia Z5 Premium (Android 7.1.1) on Nightly 67.0a1 (2019-02-15), but nothing crashed.
I opened the links suggested in Comment 5, plus other pages, but everything worked as expected.
Thank you!

Flags: needinfo?(mirabela.lobontiu)

Hello,
Unfortunately I was not able to reproduce this crash.
I tried to reproduce it with the websites from the Comment 5 and many other websites but with not so much success.
Devices: Sony Xperia Z5(Android 7.0), Google Pixel 3Xl(Android P)
During my tests I will keep an eye for this issue and I will use Sony Xperia Z5 more and if I will manage to reproduce the issue I will post a comment with the additional information.

Flags: needinfo?(andrei.bodea)

#4 overall in the current 67 beta.

Some of the crash stacks have c++ so this may affect geckoview as well.

Whiteboard: [geckoview]

__clear_cache is a clang intrinsic function to invalidate the instruction cache. The crash stacks are pretty random.

Mike, could these Android __clear_cache crashes be related to a clang compiler update or enabling LTO?

There was a spike during Fennec 64 Nightly (but none in 64 Beta or Release) and then it returned in 66 Nightly and Beta. Looking at Android LTO bug 1480006, we tried to enable LTO in 64 Nightly (early September). We turned off LTO until in 66 Nightly (mid-January). Those LTO events roughly align with the crash graph at the top of this bug.

Crash Signature: [@ __clear_cache] [@ @0x0 | __clear_cache] → [@ __clear_cache] [@ @0x0 | __clear_cache] [@ _JNIEnv::CallVoidMethod | __clear_cache]
Flags: needinfo?(mh+mozilla)
See Also: → android-lto
Whiteboard: [geckoview] → [geckoview:p1]

From looking at a few crashes, what they all have in common is that __clear_cache was found by stack scanning. To me, that smells like __clear_cache is a red herring, the pointer of which just happens to be leftover on the stack from being called before executing JIT code. Many crashes look like they are in JIT code. Others happen in system libraries that we don't have symbols for. They also have a vast disparity of causes (segfaults, invalid instructions). Others just happen to have __clear_cache deep in the stack frames (https://crash-stats.mozilla.com/report/index/f0412b73-5237-4db2-8e90-aca370190228), most of which were found from stack scanning anyways.

It seems to me this is all very different crashes that happen to end up bucketed all together because they happen to have some common unrelated thing that appear through stack scanning. And yes, compiler optimizations could be responsible for making those calls apparent in stack traces, but that doesn't necessarily mean the compiler optimizations broke anything.

I'm not sure what we can do here... maybe hiding __clear_cache as it's very unlikely to be involved, would help bucket the crashes differently.

It's also worth noting that despite having frames in libxul.so, the frames that follow are still not from cfi, which is relatively good indicator that the stack frames can't be trusted for anything. So even hiding __clear_cache would not necessarily bucket them any better.

Flags: needinfo?(mh+mozilla)

Per Comment 12, I filed Bug 1541090 to add that signature to the skip list.

fix-optional for 67 as it seems that only 66 is affected.

Closing this one out as WFM since there are no recent crashes in any of these signatures (perhaps due to the skip list add in Comment 13).

Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.