Closed Bug 1286307 Opened 9 years ago Closed 6 years ago

Crash in dalvik-main space 1 (deleted)@0xc7ffffe

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

47 Branch
Unspecified
Android
defect

Tracking

()

RESOLVED DUPLICATE of bug 1461724
mozilla54
Tracking Status
firefox47 --- wontfix
firefox48 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- fixed
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- affected

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, topcrash, Whiteboard: [#jsapi:crashes-retriage])

Crash Data

Attachments

(5 files, 3 obsolete files)

This bug was filed from the Socorro interface and is report bp-a652d2c4-1678-4eaf-bfda-c62772160712. ============================================================= Seen while reviewing crash stats. On Firefox 47 this is currently a top crash with over 8K crashes. Looks as if the crash is also present on Firefox 48b6.
Jim, do you know anything about this?
Flags: needinfo?(nchen)
Link to crashes on 47: http://bit.ly/29v34OO.
The second frame is EnterBaseline seems like a variant of https://crash-stats.mozilla.com/signature/?signature=EnterBaseline link it to the meta bug 858032? We also fixed bug 1247312 in 48 which should cause the EnterBaseline numbers to drop.
Yeah seems like a JIT crash. "dalvik-main space 1" is a bogus frame in the stack that Breakpad picked up, and it's right below the top frame with a hex address, which makes sense for JIT code. Because we recently started ignoring hex-address top frames in Socorro, "dalvik-main space 1" became the new top frame and crash signature.
Flags: needinfo?(nchen)
Naveed, this is the top crash for Fennec right now. Anything you can do to make it actionable?
Flags: needinfo?(nihsanullah)
Jan, can you please take a look at this trending crasher?
Flags: needinfo?(nihsanullah) → needinfo?(jdemooij)
Adding a similar stack signature - http://bit.ly/29IdP06 - which accounts for another 3K in crashes.
Crash Signature: [@ dalvik-main space 1 (deleted)@0xc7ffffe] → [@ dalvik-main space 1 (deleted)@0xc7ffffe] [@ dalvik-main space 1 (deleted)@0xbbffffe]
Assignee: nobody → efaustbmo
This signature has over 20403 crashes in 50 in the last 7 days - seems to have spiked a bit.
"dalvik-main space 1 (deleted)@0xbbffffe" has ~20000 reports over the last week. It seems highly correlated with a specific graphic card: (99.49% in signature vs 26.82% overall) adapter_driver_version = OpenGL ES 3.1 v1.r7p0-03rel0.b596bd02e7d0169c10574b57180c8b57 (99.49% in signature vs 28.33% overall) adapter_device_id = Mali-T760 (99.49% in signature vs 45.43% overall) adapter_vendor_id = ARM (56.76% in signature vs 16.46% overall) CPU Info = ARMv1 ARM part(0x4100d070) features: half,thumb,fastmult,vfpv2,edsp,neon,vfpv3,vfpv4,idiva,idivt (41.55% in signature vs 13.94% overall) CPU Info = ARMv1 ARM part(0x4100d030) features: half,thumb,fastmult,vfpv2,edsp,neon,vfpv3,vfpv4,idiva,idivt
Eric, any luck with this JS crash? As a more drastic measure, we might block SM-G92[XX] phones in the Play Store.
Flags: needinfo?(efaustbmo)
Naveed, this one looks like it needs more attention, and probably another owner.
Flags: needinfo?(nihsanullah)
Hannes please look into this bug. The volume is spiking. Thanks
Assignee: efaustbmo → hv1989
Flags: needinfo?(nihsanullah) → needinfo?(hv1989)
Note if you want to reproduce this it requires specific hardware. Below are the top 5 device model numbers. If you are ordering a phone for this bug please reference the model number as Samsung ships two different SOC boards depending on your location. North America gets Broadcom SOC. The rest of the world has Exynos SOC phones. Exynos is the SOC that is needed to reproduce this bug. The phone needs to be updateable to Android API 23 which is Android 6.0 (marshmallow). Mfg. Model And. API CPU ABI # samsung SM-G920F 23 (REL) armeabi-v7a 8785 37.9% samsung SM-G925F 23 (REL) armeabi-v7a 5097 22.0% samsung SAMSUNG-SM-G920A 23 (REL) armeabi-v7a 1396 6.0% samsung SM-G920V 23 (REL) armeabi-v7a 1296 5.6% samsung SM-G920I 23 (REL) armeabi-v7a 1014 4.4%
Kevin, do we know of anyone having success reproducing this locally?
I'm ordering one.
I hit this signature. I played videos linked from http://www.bbc.com/news/video_and_audio/headlines crashed right at the start of playback. bp-7377478a-5343-42ba-873e-0f0492161221 There was about 3 hours of browsing before I hit this.
In all generality JIT crashes are very hard to investigate given only a stacktrace. There is no information to look into and the data in the core dumps often also don't contain much to go on. We have been internally discussing this a lot in how we could improve this information, but we have only found ways to classify it better, not to make it more actionable. That is also the reason we rely heavily on fuzzers for finding such issues and we are actively coordinating with the fuzzer team to improve it. E.g. searching why fuzzers didn't find particular crashes and improving the instrumentation used for fuzzers. That has worked quite well, but we keep trying to improve the situation. Fuzzers are not the holy grail or the only source of finding/fixing bugs. Reproducible crashes are actionable for us. That way we can interact and try to fit the missing pieces on what went wrong. @kbrosnan: You have such a machine right? Could you try to run a debug build and try to hit this? Note: JS might be very slow in debug builds. Let us know if that is making it impossible to reproduce or use. We could try to decrease the amount of instrumentation if needed.
Flags: needinfo?(kbrosnan)
Flags: needinfo?(jdemooij)
Flags: needinfo?(hv1989)
Flags: needinfo?(efaustbmo)
Hi Kevin, i have also a samsung mobile (s6 edge latest patches) and was running into this crash here when i was trying to delete a typo in a url bar (like i forgot a space in the search) but couldn't reproduce it so far, so this crash might be really a tricky one with lots of possible steps to crash just reproducing seem to be the problem
(In reply to Hannes Verschore [:h4writer] from comment #17) > @kbrosnan: You have such a machine right? Could you try to run a debug build > and try to hit this? Note: JS might be very slow in debug builds. Let us > know if that is making it impossible to reproduce or use. We could try to > decrease the amount of instrumentation if needed. Seems that people are able to hit it in normal browser. Though nobody was able to reproduce in a debug browser. Going to order a device to try that myself and have the hardware next to me while debugging it. Hope that makes this actionable.
I have spent time trying to reproduce one of the S6 crashes using a local debug build with JimDB attached. So far I have not had any luck. Whatever caused this was pushed in part of 50.1.0. In 50.0.2 there are ~350 crashes with this signature. In 50.1.0 there are 22,000. https://crash-stats.mozilla.com/signature/?product=FennecAndroid&signature=dalvik-main%20space%201%20%28deleted%29%400xbbffffe&date=%3E%3D2016-10-10T23%3A11%3A21.000Z&date=%3C2017-01-10T23%3A11%3A21.000Z The set of changesets is https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=FENNEC_50_0_2_RELEASE&tochange=FENNEC_50_1_0_RELEASE it looks to be 30 to 50 changes.
Flags: needinfo?(kbrosnan)
Tomcat his girlfriend can reproduce this (4 times in a week). He himself couldn't. Seemed that one difference was that she had myKnox installed. She is now trying with that disabled and see if she still see crashes. We should know something about that end this week or next week, hopefully. (In reply to Kevin Brosnan [:kbrosnan] from comment #20) > The set of changesets is > https://hg.mozilla.org/releases/mozilla-release/ > pushloghtml?fromchange=FENNEC_50_0_2_RELEASE&tochange=FENNEC_50_1_0_RELEASE > it looks to be 30 to 50 changes. @Jan: Can you have a quick look through the changesets if something catches your eye that could potentially cause these jit crashes?
Flags: needinfo?(jdemooij)
(In reply to Hannes Verschore [:h4writer] from comment #21) > @Jan: Can you have a quick look through the changesets if something catches > your eye that could potentially cause these jit crashes? I don't see anything that looks related. There are some JS patches in there, but I don't see how they could cause JIT crashes and only on ARM/Android. Hannes, do these reports contain some code around the IP? If yes it would be useful to look at a number of them to see if there's a pattern.
Flags: needinfo?(jdemooij)
Couldn't reproduce this yet, though I got a lead that I'm investigating.
Component: General → JavaScript Engine: JIT
Priority: -- → P1
Product: Firefox for Android → Core
I've been looking into this for some serious time already. Tried to reproduce this on the phone I ordered. Didn't get it to fail yet. That is when using the regular browser (release). But I also tried an specific attack and compiled the JS engine. I know this is mostly happening in baseline and I think switching between the cores might be related. As a result I constructed a payload to run a benchmark that would exhibit this problem if running long enough. But couldn't get it to reproduce. Starting to think this might be a newer version of the chip? Asked Standard8 for his cpuinfo, since he crashes on it. To see if it is older. Didn't get an answer yet. I'm quite convinced this is a CPU bug (in ARM big.LITTLE implementation of A53 and A57). There are a lot of erratas on Aarch64 programs. One that would fit this issue quite nicely: http://www.mono-project.com/news/2016/09/12/arm64-icache/ Now we are running Aarch32 which as far as I know is not affected. Or I'm misreading the source. Now switching to Arm64 for firefox is definitely not a solution yet. Also our logs show that we crash on any address. Not specific the top 64bits. One of the specifics this report is mentioning. I've looked at different dumps, but lately more to the SIGILL onces. Since the logs are dominated by them. I can definitely see that this happens mostly in baseline code. Main code and stubs. All happening on valid code (ldr/str/mov/...). Not specific memory opcodes. I haven't been able to find a correlation between logs yet. I also skimmed the internet for specific Aarch32 a53/a57 errata. But haven't found anything useful yet. I think people mostly run arm64 on that architecture.
Related to all my investigation I'm able to trigger a crash in the Galaxy S6. I'm hoping this is this bug, but I cannot see it. Every time I press I want to send the crash information, but I cannot see this crash report in "about:crashes". Is there something I can do about that. It would be easier to fix if I could reproduce and it would be stupid that I'm already able to reproduce but I don't see it... What can I do to see the crash report?
Flags: needinfo?(ted)
Fill out the email address section of the crash reporter. You can then log onto crash-stats.mozilla.org with that email account's Google account. Either click the Your profile link in the lower left or visit https://crash-stats.mozilla.com/profile/ Else fill out the email section and people with rawdump access can provide you the crash ids. Ping me or ask in #crashkill if you need this. Leaving the Ted NI in case he has other ideas.
(In reply to Kevin Brosnan [:kbrosnan] from comment #26) > Fill out the email address section of the crash reporter. You can then log > onto crash-stats.mozilla.org with that email account's Google account. > Either click the Your profile link in the lower left or visit > https://crash-stats.mozilla.com/profile/ > > Else fill out the email section and people with rawdump access can provide > you the crash ids. Ping me or ask in #crashkill if you need this. > > Leaving the Ted NI in case he has other ideas. Thanks for the tip. I did fill out the email address: hverschore@mozilla.com. Nowhere visible on https://crash-stats.mozilla.com/profile/ either?
Are you saying that you're submitting the crash report but it doesn't show up in about:crashes? If that's the case then something must be going wrong with the submission. Note there's also bug 1319071, which makes unsubmitted crash reports not visible in about:crashes on linux systems (including Android). (In reply to Hannes Verschore [:h4writer] from comment #27) > Thanks for the tip. I did fill out the email address: > hverschore@mozilla.com. Nowhere visible on > https://crash-stats.mozilla.com/profile/ either? I searched for crash reports with that email and didn't find any, so something is clearly not working.
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #28) > Are you saying that you're submitting the crash report but it doesn't show > up in about:crashes? If that's the case then something must be going wrong > with the submission. > > Note there's also bug 1319071, which makes unsubmitted crash reports not > visible in about:crashes on linux systems (including Android). > > (In reply to Hannes Verschore [:h4writer] from comment #27) > > Thanks for the tip. I did fill out the email address: > > hverschore@mozilla.com. Nowhere visible on > > https://crash-stats.mozilla.com/profile/ either? > > I searched for crash reports with that email and didn't find any, so > something is clearly not working. Yes, I submitted the crash report, but I'm not seeing them in about:crashes nor find them online. Which is quite annoying. What can I do to diagnose or fix this. (I hope I can still reproduce, since I flashed an older rom on the device now).
Can you capture logcat while reproducing the crash and submitting? If we're failing to submit there ought to be some information there.
I thought this might be caused by an older kernel and flashed my device. I only found out later that we record the builds in the "app notes" and looking into them there doesn't seem a particular version that occurs more, except for the last builds. Seems like everybody nicely upgrades their phones. Now it does mean that my crash reports are now getting submitted. I also found a way to reproduce. Every 24h I have 1-2 crash reports (very irregular). I looked at the 3 I have: https://crash-stats.mozilla.com/report/index/20d1071c-9dc1-488a-ab3e-196772170126 https://crash-stats.mozilla.com/report/index/ddd1ae46-2856-427c-8473-274de2170126 https://crash-stats.mozilla.com/report/index/a080cc91-8a53-48c7-9222-9c5822170125 The ddd1ae46-2856-427c-8473-274de2170126 is definitely the same issue we have here. The other two look similar and could be the same cause, but also could be something else. I will run this for the rest of the week. That way I have better idea of how much crashes we have and how they look. Reading the errata of the A53 and fixes others are doing (in aarch64) I have potentially an idea what workarounds I can try. Next week I'll compile a build with a workaround and search for a workaround that works.
Crash Signature: [@ dalvik-main space 1 (deleted)@0xc7ffffe] [@ dalvik-main space 1 (deleted)@0xbbffffe] → [@ dalvik-main space 1 (deleted)@0xc7ffffe] [@ dalvik-main space 1 (deleted)@0xbbffffe] [@ dalvik-main space 1 (deleted)@0xd2ffffe]
Crash Signature: [@ dalvik-main space 1 (deleted)@0xc7ffffe] [@ dalvik-main space 1 (deleted)@0xbbffffe] [@ dalvik-main space 1 (deleted)@0xd2ffffe] → [@ dalvik-main space 1 (deleted)@0xc7ffffe] [@ dalvik-main space 1 (deleted)@0xbbffffe] [@ dalvik-main space 1 (deleted)@0xd2ffffe] [@ arena_malloc | dalvik-main space 1 (deleted)@0xbbffffe] [@ dalvik-large object space allocation (deleted)@0x0] [@ d…
Yesterday I tried "workaround 1". Here I flush the cache twice, making sure the start address of the flush differs. It has been happily buzzing for almost 24h. No crash yet! Now it could be I only made it less likely to hit this bug. Though it would already be a 4x reduction, currently. Which would already be an improvement. I'll keep it buzzing for some more time and see. In the meantime I'll try to add some cpu detection to make sure we only do this workaround on the affected cpus.
I didn't expected my first workaround to be good immediately, but it has been running for 2 days without crashes. (Not saying that it fully fixes it, but it will at least decreases it 10x. Crash-stat will probably give a better picture. I'll also keep the phone running.) When I went through the errata there were some know bugs w.r.t. the big.LITTLE architecture and when switching between cores. In some cases the instruction cache wouldn't get updated. There were also issues with the kernel code (in arm64) that was caching the cache line length. But found none that would match our case exactly. Some fixes I saw (on arm64) were to add NOPS and to increase the size of flushcaching. Now a few of the errata are arm64 specific. And I assume a few of the errata for arm32 are not found yet. Doing the cacheflush twice works and I deliberate made sure the starting address is different for both. I think I read somewhere that the start address made a difference for some bugs. In our case it is also fine, since we put the pointer to JitCode first. I'm very happy we don't have to add extra NOPs, since that would be awful for our stubs.
Attachment #8832716 - Flags: review?(jdemooij)
Comment on attachment 8832716 [details] [diff] [review] Workaround by doing two cacheflushes Review of attachment 8832716 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for digging into this! It's definitely worth a try to see how it affects crash stats. ::: js/src/jit/ExecutableAllocator.cpp @@ +327,5 @@ > + // The exynos7420 cpu (EU galaxy S6/S7 (Note)) has a bug where sometimes > + // flushing doesn't invalidate the instruction cache. As a result we force > + // it by calling the cacheFlush twice on different start address. > + FILE* fp = fopen("/proc/cpuinfo", "r"); > + if (fp) { I'm removing initStatic in another (s-s) bug and we don't need this on non-Linux/ARM. What do you think about moving this into Architecture-arm.cpp, maybe in InitARMFlags where we have other code that opens /proc/cpuinfo? We could expose a NeedsDoubleCacheFlush() function and call that in ExecutableAllocator.h. (We should move the cache flush code for each platform into Architecture-*.cpp/h, the code in ExecutableAllocator.h is becoming quite unwieldy, but that's not necessary for this bug.)
Attachment #8832716 - Flags: review?(jdemooij)
My fault. I tried this, but included the wrong file and as a result thought we had cyclic dependencies.
Attachment #8832716 - Attachment is obsolete: true
Attachment #8832983 - Flags: review?(jdemooij)
Comment on attachment 8832983 [details] [diff] [review] Workaround by doing two cacheflushes Review of attachment 8832983 [details] [diff] [review]: ----------------------------------------------------------------- Sorry for the delay. Will be interesting to see how it affects crash stats :) ::: js/src/jit/ExecutableAllocator.cpp @@ +30,5 @@ > #include "mozilla/Atomics.h" > > +#ifdef JS_CODEGEN_ARM > +#include "jit/arm/Architecture-arm.h" > +#endif Nit: these lines can be removed I think. ::: js/src/jit/arm/Architecture-arm.cpp @@ +247,5 @@ > } > + > + // The exynos7420 cpu (EU galaxy S6 (Note)) has a bug where sometimes > + // flushing doesn't invalidate the instruction cache. As a result we force > + // it by calling the cacheFlush twice on different start address. Nit: addresses
Attachment #8832983 - Flags: review?(jdemooij) → review+
Pushed by hv1989@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/44e3e172cb97 Double flush the instruction cache as workaround on the exynos7420 chipset, r=jandem
Pushed by hv1989@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/abfb63650780 Include Architecture-arm in ExecutableAllocator, r=bustage ON CLOSED TREE
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Please request Aurora/Beta approval on this when you get a chance.
Attached patch Beta patchSplinter Review
Flags: needinfo?(hv1989)
Attachment #8834860 - Flags: review+
Comment on attachment 8834860 [details] [diff] [review] Beta patch Approval Request Comment [Feature/Bug causing the regression]: The JITS on exynos7420 cpus (galaxy s6) [User impact if declined]: Crashes at random. [Is this code covered by automated tests?]: Yes [Has the fix been verified in Nightly?]: Yes [Needs manual test from QE? If yes, steps to reproduce]: / [List of other uplifts needed for the feature/fix]: / [Is the change risky?]: No [Why is the change risky/not risky?]: It tests if it is this specific cpu and will flush the cache twice. I.e. the features are already quite good tested. It could add a little bit of slowdown on flushing. But that should not be noticeable. [String changes made/needed]: /
Attachment #8834860 - Flags: approval-mozilla-beta?
Comment on attachment 8834863 [details] [diff] [review] Aurora patch Approval Request Comment [Feature/Bug causing the regression]: The JITS on exynos7420 cpus (galaxy s6) [User impact if declined]: Crashes at random. [Is this code covered by automated tests?]: Yes [Has the fix been verified in Nightly?]: Yes [Needs manual test from QE? If yes, steps to reproduce]: / [List of other uplifts needed for the feature/fix]: / [Is the change risky?]: No [Why is the change risky/not risky?]: It tests if it is this specific cpu and will flush the cache twice. I.e. the features are already quite good tested. It could add a little bit of slowdown on flushing. But that should not be noticeable. [String changes made/needed]: /
Attachment #8834863 - Flags: approval-mozilla-aurora?
Comment on attachment 8834863 [details] [diff] [review] Aurora patch Fix a crash. Aurora53+.
Attachment #8834863 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8834860 [details] [diff] [review] Beta patch force double cache flush on galaxy s6 to fix random crashes, beta52+
Attachment #8834860 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
FTR, I think this is the same bug as described here: http://www.mono-project.com/news/2016/09/12/arm64-icache/
needs rebasing for beta patching file js/src/jit/ExecutableAllocator.h Hunk #1 FAILED at 30 Hunk #2 FAILED at 277 2 out of 2 hunks FAILED -- saving rejects to file js/src/jit/ExecutableAllocator.h.rej patching file js/src/jit/arm/Architecture-arm.cpp Hunk #2 FAILED at 190 1 out of 2 hunks FAILED -- saving rejects to file js/src/jit/arm/Architecture-arm.cpp.rej patch failed, unable to continue (try -v) patch failed, rejects left in working directory errors during apply, please fix and qrefresh file_1286307.txt
Flags: needinfo?(hv1989)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #49) > FTR, I think this is the same bug as described here: > http://www.mono-project.com/news/2016/09/12/arm64-icache/ see comment 24, where I listed that already. Though that bug is ARM64. We are using ARM32.
Flags: needinfo?(hv1989)
Attached patch Rebased beta patch (obsolete) — Splinter Review
Attachment #8835973 - Flags: review+
Attached file .rej file
still have problems to apply also with a new clone attaching js/src/jit/ExecutableAllocator.h.rej
Flags: needinfo?(hv1989)
Attached patch Rebased beta patch (obsolete) — Splinter Review
Attachment #8835973 - Attachment is obsolete: true
Flags: needinfo?(hv1989)
Attachment #8835998 - Attachment is obsolete: true
Attachment #8836003 - Flags: review+
Setting qe-verify- per comment 45.
Flags: qe-verify-
Preliminary data shows this decreased crash-rate with 50%. We only have 2 days of data on beta though. I'm keeping my eyes on the graphs.
Re-opening. Looking now it hasn't changed the crash-rate in beta. Looking again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(hv1989)
Noting that this is still the #2 top crash on release 53.
Whiteboard: [#jsapi:crashes-retriage]
This still seems to be high-volume. Can anything else be done here?
Flags: needinfo?(hv1989) → needinfo?(jdemooij)
Assignee: hv1989 → nobody
Looking at 62.0.3 mobile crashes, any process, I see the first dalvik-main crash at #68 and it seems to be improving: https://crash-stats.mozilla.com/topcrashers/?product=FennecAndroid&version=62.0.3&_facets_size=200&process_type=any Given this and the fact our workarounds so far have failed, I don't think this is worth spending a lot of time on right now.
Flags: needinfo?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #63) > Given this and the fact our workarounds so far have failed, I don't think > this is worth spending a lot of time on right now. Marking this bug as stalled until we find better avenue for investigating this issue.

Folding this bug into Bug 1461724 where we see the same sort of issue for two more big.LITTLE processors developed by Qualcomm this time. The Exynos7420 crashes still remain. The crashes apply to both arm32 and arm64 builds.

Status: REOPENED → RESOLVED
Closed: 8 years ago6 years ago
Resolution: --- → DUPLICATE

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: