Closed
Bug 1450793
Opened 5 years ago
Closed 5 years ago
Kernel changes on some devices are causing a huge crash spike (especially on S8)
Categories
(Firefox for Android Graveyard :: General, defect)
Tracking
(firefox59+ wontfix, firefox60+ fixed, firefox61+ fixed)
RESOLVED
FIXED
Firefox 61
People
(Reporter: lizzard, Assigned: snorp)
References
Details
Attachments
(1 file, 1 obsolete file)
59 bytes,
text/x-review-board-request
|
glandium
:
review+
RyanVM
:
approval-mozilla-beta+
jcristau
:
approval-mozilla-release-
|
Details |
This crash is showing up in high volume on the Play Store crash records but not on crash-stats, for Android 8.0 and up, on several different Galaxy S8 devices: signal 11 (SIGSEGV), code 2 (SEGV_ACCERR) in libmozglue.so *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** pid: 0, tid: 0 >>> org.mozilla.firefox <<< backtrace: #00 pc 0000000000043d66 /data/app/org.mozilla.firefox-cEa3LWSWhqBfUNxUk_l7Jw==/lib/arm/libmozglue.so These are the affected devices, dreamqlteue dream2qlteue SCV36 dreamqltecan dream2qltecan And the Note8 (greatqlte) as well.
Reporter | ||
Comment 1•5 years ago
|
||
snorp, michael, any ideas for next steps?
status-firefox59:
--- → affected
tracking-firefox59:
--- → +
Flags: needinfo?(snorp)
Flags: needinfo?(michael.l.comella)
Reporter | ||
Comment 2•5 years ago
|
||
Tom, I noticed we are only testing releases with Android 7 (https://wiki.mozilla.org/QA/Fennec/59/0.2/RC). Should we start testing 8.0 as well, for Firefox 60?
Flags: needinfo?(tgrabowski)
Comment 3•5 years ago
|
||
I got the dbg info from https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/libmozglue.so/823E205F9225ED5562448724797BC62C0/libmozglue.so.dbg.gz and addr2line returns: addr2line -f -C -e libmozglue.so.dbg 0x43d66 ElfLoader::DebuggerHelper::Add(ElfLoader::link_map*) /builds/worker/workspace/build/src/mozglue/linker/ElfLoader.cpp:925 So the culprit is https://hg.mozilla.org/releases/mozilla-release/annotate/tip/mozglue/linker/ElfLoader.cpp#l925. :glandium, could you investigate please ?
Flags: needinfo?(mh+mozilla)
Comment 4•5 years ago
|
||
From GP, there are 225149 crashes in 59.0.2 and 180234 in 59.0.1. For 95% of the crashes, android version is 8.0 and for 85% of the crashes the device is a Galaxy S8.
Comment 5•5 years ago
|
||
Without a full crash report I can't say anything. And I don't have a device where I could put Android 8.0 on to reproduce.
Flags: needinfo?(mh+mozilla)
Comment 6•5 years ago
|
||
Perhaps https://bit.ly/2GMljCL may be the crash, as most of the devices in that list are Samsung S8 devices running API26. Currently that signature is #21 top crash in release.
Comment 7•5 years ago
|
||
Not the same crash at all unfortunately.
Mike I have a Galaxy S8 with Oreo, let me know how I can help produce a crash report.
Flags: needinfo?(mh+mozilla)
Comment 9•5 years ago
|
||
The first thing would be have a complete copy of the crash report emitted by android itself, which normally appears in logcat.
Flags: needinfo?(mh+mozilla) → needinfo?(sdaswani)
Comment 10•5 years ago
|
||
I don't have any STRs though?
Flags: needinfo?(sdaswani) → needinfo?(mh+mozilla)
Comment 11•5 years ago
|
||
(In reply to :sdaswani from comment #10) > I don't have any STRs though? I think if you are on the previous OS and update to Oreo you should get the crash. That's what many of Google Play Store comments mention.
Updated•5 years ago
|
Flags: needinfo?(mh+mozilla)
Comment 13•5 years ago
|
||
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #2) > Tom, I noticed we are only testing releases with Android 7 > (https://wiki.mozilla.org/QA/Fennec/59/0.2/RC). Should we start testing 8.0 > as well, for Firefox 60? It's best to ask No-Jun, since it's his team testing it.
Flags: needinfo?(tgrabowski) → needinfo?(nojun.park)
Comment 14•5 years ago
|
||
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #2) > Tom, I noticed we are only testing releases with Android 7 > (https://wiki.mozilla.org/QA/Fennec/59/0.2/RC). Should we start testing 8.0 > as well, for Firefox 60? (In reply to Tom Grabowski [:TomGrab] from comment #13) > (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #2) > > Tom, I noticed we are only testing releases with Android 7 > > (https://wiki.mozilla.org/QA/Fennec/59/0.2/RC). Should we start testing 8.0 > > as well, for Firefox 60? > > It's best to ask No-Jun, since it's his team testing it. Hi Liz, Tom, No-Jun The link in the example was the second respin and it was a regression/light testing on it done as it was previously testes thoroughly. Most of times we do test several other build too ( we exchange devices/Operating systems/manufactures.) https://wiki.mozilla.org/QA/Fennec/60/Beta/3/smoketests-results https://wiki.mozilla.org/QA/Fennec/60/Beta/5/smoketests-results https://wiki.mozilla.org/QA/Fennec/59/RC3 - the original release andidate etc. We did try it today with the S8 and will comment at the end of the day with results.
Flags: needinfo?(ioana.chiorean) → needinfo?(bogdan.surd)
Comment 15•5 years ago
|
||
Device: -Samsung Galaxy S8 with Android 7.0 updated to Android 8.0 Hello, I have installed Nightly, Beta and Release from the Play Store after browsing for a while went on to update my Samsung device from Android N to O. Ran light smoke tests on the device as well as accessing some of the pages mentioned by Kevin in the mail on each of the builds and only managed to get one crash in Nightly without any clear or reliable STR. Here is the crash report from Nightly: https://crash-stats.mozilla.com/report/index/3cd969d4-3cd4-45e2-b056-0d4c60180405 Will continue testing and post anything else or relevance that I might find.
Flags: needinfo?(bogdan.surd)
Reporter | ||
Comment 16•5 years ago
|
||
I think we should test as in comment 11. The crash seems to be with the update from 7.x to 8.0. The urls Kevin mentioned aren't necessarily associated with this crash.
Reporter | ||
Comment 17•5 years ago
|
||
Should this be linked to the meta bug 1450105?
Reporter | ||
Comment 18•5 years ago
|
||
Bogdan, do you have more details on the hardware? It looks like there are many different model numbers and variants for the S8: http://www.teamandroid.com/2017/04/13/samsung-galaxy-s8-model-numbers-variants/ But only a few code names are affected (listed in comment 0). So I wonder if we can come up with the code name / model number and figure that out. It could be that the hardware you're testing isn't one of the affected ones.
Flags: needinfo?(bogdan.surd)
Reporter | ||
Comment 19•5 years ago
|
||
I would bet that is our issue. Here are the affected model numbers. They probably use a different chip, or something. U.S. S8 SM-G950U1 (dreamqlteue) U.S. S8 plus SM-G955U1 (dream2qlteue) Canadian S8 SM-G950W (dreamqltecan) Canadian S8+ SM-G955W (dream2qltecan) Australian Galaxy S8 SCV36 Andreas, Kevin, maybe we should move forward to get one of the US or Canadian devices for testing this.
Flags: needinfo?(kbrosnan)
Flags: needinfo?(abovens)
Comment 21•5 years ago
|
||
Yes. The US/Canada models use a Qualcomm SOC and the other variants use Samsung's Exynos SOC. I'll see what I can do about picking one up, likely best to buy -> expense over SN.
Comment 22•5 years ago
|
||
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #18) > Bogdan, do you have more details on the hardware? I have the SM-G950F model. Don't have any other model on hand to test this with. (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #16) > I think we should test as in comment 11. The crash seems to be with the > update from 7.x to 8.0. > The urls Kevin mentioned aren't necessarily associated with this crash. This is exactly how I tested the issue, by upgrading from Android 7 to Android 8. Sadly everything worked as expected for my device model on all versions of FF.
Flags: needinfo?(bogdan.surd)
Comment 23•5 years ago
|
||
The symbols indicate this is a platform crash so unfortunately I don't think I have the skill set to help here – Snorp might have more info. > Should this be linked to the meta bug 1450105? Those bugs seem to be front-end Android resource corruption while this one does not (it's a platform crash). fwiw, Susheel may have one of the devices specified in comment 19.
Flags: needinfo?(michael.l.comella)
Comment 24•5 years ago
|
||
Thanks Mike. Since this seems like a platform issue we should let Snorp & Co. take it from here.
Comment 26•5 years ago
|
||
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #0) > This crash is showing up in high volume on the Play Store crash records but > not on crash-stats, for Android 8.0 and up, on several different Galaxy S8 > devices: > > > signal 11 (SIGSEGV), code 2 (SEGV_ACCERR) > in libmozglue.so > > *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** pid: 0, tid: > 0 >>> org.mozilla.firefox <<< backtrace: > > #00 pc 0000000000043d66 > /data/app/org.mozilla.firefox-cEa3LWSWhqBfUNxUk_l7Jw==/lib/arm/libmozglue.so Is this the entirety of the info that Google Play provides for this crash?
Comment 27•5 years ago
|
||
(In reply to Calixte Denizet (:calixte) from comment #3) > So the culprit is > https://hg.mozilla.org/releases/mozilla-release/annotate/tip/mozglue/linker/ > ElfLoader.cpp#l925. Given the fact that the line above that line is `EnsureWritable w(&dbg->r_map->l_prev);`, and the crash subcode is `SEGV_ACCERR`, it sounds like `EnsureWritable` is failing here. That method doesn't check the return value of `mprotect`: https://hg.mozilla.org/releases/mozilla-release/annotate/tip/mozglue/linker/ElfLoader.cpp#l848 ...although there's probably nothing useful to be done if that fails except crash anyway.
Comment 28•5 years ago
|
||
Andreas can you see if we can get more info from your G contact per comment 26?
Flags: needinfo?(abovens)
Comment 29•5 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #26) > (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #0) > > This crash is showing up in high volume on the Play Store crash records but > > not on crash-stats, for Android 8.0 and up, on several different Galaxy S8 > > devices: > > > > > > signal 11 (SIGSEGV), code 2 (SEGV_ACCERR) > > in libmozglue.so > > > > *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** pid: 0, tid: > > 0 >>> org.mozilla.firefox <<< backtrace: > > > > #00 pc 0000000000043d66 > > /data/app/org.mozilla.firefox-cEa3LWSWhqBfUNxUk_l7Jw==/lib/arm/libmozglue.so > > Is this the entirety of the info that Google Play provides for this crash? It's the only thing we've for this crash on the reports I read. It seems that there is no way to search in the crash reports on GP (nothing similar to Socorro).
Reporter | ||
Comment 30•5 years ago
|
||
I filed https://issuetracker.google.com/issues/77861779 but we don't have great info to put there yet. As soon as we have info from logcat, I think we should share it there.
Comment 31•5 years ago
|
||
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #28) > Andreas can you see if we can get more info from your G contact per comment > 26? I've contacted them.
Comment hidden (mozreview-request) |
Comment 33•5 years ago
|
||
mozreview-review |
Comment on attachment 8967190 [details] Bug 1450793 - Guard against mprotect() failure when manipulating link map https://reviewboard.mozilla.org/r/235848/#review241644 ::: mozglue/linker/ElfLoader.cpp:992 (Diff revision 1) > /* When adding a library for the first time, r_map points to data > * handled by the system linker, and that data may be read-only */ > EnsureWritable w(&dbg->r_map->l_prev); > + if (!w.IsWritable()) { > + writeDisabled = true; > + WARN("Unable to add library to debug list"); The distinction between "failed to add" and "failed to remove" is kind of useless, and you might as well move the logging to EnsureWritable itself. You should also just use one bool attached to the DebuggerHelper class instead of having one static per function, which both will end up with the same value, most likely. Also, just mostly silently ignoring the error is not going to help us on the long run. We should at the very least log *why* mproctect failed, logging the values it's passed, and the errno it returned. Even better would be getting that information somehow. With all that being said, I'm wondering if the devices where this is happening have a non-4k page size. Do we have that information in the crash reports?
Attachment #8967190 -
Flags: review?(mh+mozilla)
Assignee | ||
Comment 34•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #33) > Comment on attachment 8967190 [details] > Bug 1450793 - Guard against mprotect() failure when manipulating link map > > https://reviewboard.mozilla.org/r/235848/#review241644 > > ::: mozglue/linker/ElfLoader.cpp:992 > (Diff revision 1) > > /* When adding a library for the first time, r_map points to data > > * handled by the system linker, and that data may be read-only */ > > EnsureWritable w(&dbg->r_map->l_prev); > > + if (!w.IsWritable()) { > > + writeDisabled = true; > > + WARN("Unable to add library to debug list"); > > The distinction between "failed to add" and "failed to remove" is kind of > useless, and you might as well move the logging to EnsureWritable itself. The problem with this is that we don't have adequate context to generate a sane error message. All we know is that mprotect failed (and why), but not the reason we were trying to mprotect in the first place. "Failed to make 0xdeadbeef writable" is not useful. > You should also just use one bool attached to the DebuggerHelper class > instead of having one static per function, which both will end up with the > same value, most likely. Sure. > > Also, just mostly silently ignoring the error is not going to help us on the > long run. We should at the very least log *why* mproctect failed, logging > the values it's passed, and the errno it returned. Even better would be > getting that information somehow. Yeah, just trying to stop the bleeding for now, but I'll add that. > > With all that being said, I'm wondering if the devices where this is > happening have a non-4k page size. Do we have that information in the crash > reports? We don't have the page size anywhere AFAICT, but it appears[0] that 4k, 16k, and 64k are all possible on arm64. We can query this via sysconf(), so I'll add another that does this and uses the correct alignment. [0] https://github.com/torvalds/linux/blob/master/arch/arm64/Kconfig#L264
Flags: needinfo?(snorp)
Comment 35•5 years ago
|
||
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #34) > (In reply to Mike Hommey [:glandium] from comment #33) > > Comment on attachment 8967190 [details] > > Bug 1450793 - Guard against mprotect() failure when manipulating link map > > > > https://reviewboard.mozilla.org/r/235848/#review241644 > > > > ::: mozglue/linker/ElfLoader.cpp:992 > > (Diff revision 1) > > > /* When adding a library for the first time, r_map points to data > > > * handled by the system linker, and that data may be read-only */ > > > EnsureWritable w(&dbg->r_map->l_prev); > > > + if (!w.IsWritable()) { > > > + writeDisabled = true; > > > + WARN("Unable to add library to debug list"); > > > > The distinction between "failed to add" and "failed to remove" is kind of > > useless, and you might as well move the logging to EnsureWritable itself. > > The problem with this is that we don't have adequate context to generate a > sane error message. All we know is that mprotect failed (and why), but not > the reason we were trying to mprotect in the first place. "Failed to make > 0xdeadbeef writable" is not useful. There are only three places where EnsureWritable is used, two are the ones you changed, which it doesn't really matter which is which, because in the end, they touch the same thing anyway, and the third being the diversion of sigaction. Any of those failing is about as problematic. I'm not really convinced that knowing exactly which failed makes a significant difference, especially considering they really shouldn't be failing in the first place. Actually looking at the code, it turns out the third is never happening anyways, ever since bug 1376704... So I guess the last paragraph of the commit message for that bug was being too cautious, and we can remove all that stuff now. > > With all that being said, I'm wondering if the devices where this is > > happening have a non-4k page size. Do we have that information in the crash > > reports? > > We don't have the page size anywhere AFAICT, but it appears[0] that 4k, 16k, > and 64k are all possible on arm64. We can query this via sysconf(), so I'll > add another that does this and uses the correct alignment. > > [0] https://github.com/torvalds/linux/blob/master/arch/arm64/Kconfig#L264 That's why I'm asking what the real page size was. A different page size is about the only thing that would make sense to cause an error. If it's not that, then I can only assume Android 8 has some changes that over-protect some pages.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 40•5 years ago
|
||
mozreview-review |
Comment on attachment 8967190 [details] Bug 1450793 - Guard against mprotect() failure when manipulating link map https://reviewboard.mozilla.org/r/235848/#review241666 ::: mozglue/linker/ElfLoader.cpp:908 (Diff revision 2) > return; > > page = firstPage; > - mprotect(page, length, prot | PROT_WRITE); > + success = (mprotect(page, length, prot | PROT_WRITE) == 0); > + if (!success) { > + WARN("Failed to make %p - 0x%x writable: %s\n", ptr, The ptr-ptr+length range is not what we're trying to make writable. page-page+length is. prot would be a good thing to log too. Overall, I'd just advise logging strace-style.
Attachment #8967190 -
Flags: review?(mh+mozilla)
Comment 41•5 years ago
|
||
mozreview-review |
Comment on attachment 8967240 [details] Bug 1450793 - Don't assume 4k page size https://reviewboard.mozilla.org/r/235942/#review241668 ::: mozglue/linker/Utils.h:110 (Diff revision 2) > /** > * Page alignment helpers > */ > -static inline size_t PageSize() > +static size_t PageSize() > { > - return 4096; > + static size_t pgsz = 0; not really great to have a static variable in a static function, which means there's going to have an instance of the static variable in each compilation unit the header is included in. Might as well use a global atomic variable.
Attachment #8967240 -
Flags: review?(mh+mozilla)
Comment 42•5 years ago
|
||
I'm thinking, doing both changes at the same time will undoubtedly fix the issue... without giving us any insight wrt which of the two patches actually fixed it... and that doesn't sound great. How about landing the page size patch first, and see how things go from there?
Comment hidden (mozreview-request) |
Assignee | ||
Updated•5 years ago
|
Attachment #8967190 -
Attachment is obsolete: true
Assignee | ||
Comment 44•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #42) > I'm thinking, doing both changes at the same time will undoubtedly fix the > issue... without giving us any insight wrt which of the two patches actually > fixed it... and that doesn't sound great. How about landing the page size > patch first, and see how things go from there? Yup, fine with me.
Comment 45•5 years ago
|
||
mozreview-review |
Comment on attachment 8967240 [details] Bug 1450793 - Don't assume 4k page size https://reviewboard.mozilla.org/r/235942/#review242002 ::: mozglue/linker/Utils.h:106 (Diff revision 3) > static FILE *empty() { return nullptr; } > static void release(FILE *f) { if (f) fclose(f); } > }; > typedef mozilla::Scoped<AutoCloseFILETraits> AutoCloseFILE; > > +extern mozilla::Atomic<size_t> gPageSize; Atomic<size_t, ReleaseAcquire> ::: mozglue/linker/Utils.cpp:7 (Diff revision 3) > + * License, v. 2.0. If a copy of the MPL was not distributed with this file, > + * You can obtain one at http://mozilla.org/MPL/2.0/. */ > + > +#include "Utils.h" > + > +mozilla::Atomic<size_t> gPageSize; You could put this in ElfLoader.cpp.
Attachment #8967240 -
Flags: review?(mh+mozilla) → review+
Comment 46•5 years ago
|
||
Pushed by jwillcox@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e73f2e16ec41 Don't assume 4k page size r=glandium
Reporter | ||
Comment 47•5 years ago
|
||
Tracking this for 60 as well.
status-firefox60:
--- → affected
tracking-firefox60:
--- → +
Assignee | ||
Comment 48•5 years ago
|
||
Comment on attachment 8967240 [details] Bug 1450793 - Don't assume 4k page size Approval Request Comment [Feature/Bug causing the regression]: None, caused by kernel changes on some devices [User impact if declined]: Random startup crashes [Is this code covered by automated tests?]: Yes [Has the fix been verified in Nightly?]: Yes [Needs manual test from QE? If yes, steps to reproduce]: No [List of other uplifts needed for the feature/fix]: None [Is the change risky?]: No [Why is the change risky/not risky?]: Will use the same page size as old code in the vast majority of cases. [String changes made/needed]: None
Attachment #8967240 -
Flags: approval-mozilla-release?
Attachment #8967240 -
Flags: approval-mozilla-beta?
Updated•5 years ago
|
Assignee: nobody → snorp
Updated•5 years ago
|
Summary: Top crash by far on 59.0 Android builds → Kernel changes on some devices are causing a huge crash spike (especially on S8)
Updated•5 years ago
|
status-firefox61:
--- → affected
tracking-firefox61:
--- → +
Comment 49•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e73f2e16ec41
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 61
Comment 50•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e73f2e16ec41
Comment 51•5 years ago
|
||
Comment on attachment 8967240 [details] Bug 1450793 - Don't assume 4k page size Approved for 60.0b13. Let's hope it helps!
Attachment #8967240 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 52•5 years ago
|
||
bugherderuplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/df8719ae5e6f
Updated•5 years ago
|
Flags: needinfo?(abovens)
Comment 53•5 years ago
|
||
I think I've had some success reproducing this. Crashes within a few seconds of opening a new tab, often when Firefox has been idle for a while.
Flags: needinfo?(kbrosnan)
Reporter | ||
Comment 54•5 years ago
|
||
We only have a little bit of early data from beta 13 on the Play Store, but that early feedback is showing the crash rate still just as high (around 4.2% for Android 8.0).
Updated•5 years ago
|
Attachment #8967240 -
Flags: approval-mozilla-release? → approval-mozilla-release-
Comment 56•5 years ago
|
||
a user at https://old.reddit.com/r/firefox/comments/8g5tpa/firefox_crash_oom/ is reporting about random OOM crashes on a S8 with Oreo.
Reporter | ||
Comment 57•5 years ago
|
||
Reopening this issue since the high crash volume still exists. Or, we can open a new meta bug for the problem if people prefer that. Attempted fixes: bug 1455662, bug 1460989, as well as the patches in this bug.
Reporter | ||
Updated•5 years ago
|
tracking-firefox62:
--- → +
Reporter | ||
Comment 58•5 years ago
|
||
Tracking bug 1460989 instead since we're backing patches out from there.
Status: REOPENED → RESOLVED
Closed: 5 years ago → 5 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
status-firefox62:
affected → ---
tracking-firefox62:
+ → ---
Updated•2 years ago
|
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•