Closed Bug 1450793 Opened 3 years ago Closed 3 years ago

Kernel changes on some devices are causing a huge crash spike (especially on S8)

Categories

(Firefox for Android :: General, defect)

Firefox 59
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 61
Tracking Status
firefox59 + wontfix
firefox60 + fixed
firefox61 + fixed

People

(Reporter: lizzard, Assigned: snorp)

References

Details

Attachments

(1 file, 1 obsolete file)

This crash is showing up in high volume on the Play Store crash records but not on crash-stats, for Android 8.0 and up, on several different Galaxy S8 devices:


signal 11 (SIGSEGV), code 2 (SEGV_ACCERR)
in libmozglue.so

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** pid: 0, tid: 0 >>> org.mozilla.firefox <<< backtrace:

#00 pc 0000000000043d66 /data/app/org.mozilla.firefox-cEa3LWSWhqBfUNxUk_l7Jw==/lib/arm/libmozglue.so

These are the affected devices, 

dreamqlteue
dream2qlteue
SCV36
dreamqltecan
dream2qltecan
And the Note8 (greatqlte) as well.
snorp, michael, any ideas for next steps?
Flags: needinfo?(snorp)
Flags: needinfo?(michael.l.comella)
Tom, I noticed we are only testing releases with Android 7 (https://wiki.mozilla.org/QA/Fennec/59/0.2/RC). Should we start testing 8.0 as well, for Firefox 60?
Flags: needinfo?(tgrabowski)
I got the dbg info from https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/libmozglue.so/823E205F9225ED5562448724797BC62C0/libmozglue.so.dbg.gz and addr2line returns:

addr2line -f -C -e libmozglue.so.dbg 0x43d66
ElfLoader::DebuggerHelper::Add(ElfLoader::link_map*)
/builds/worker/workspace/build/src/mozglue/linker/ElfLoader.cpp:925

So the culprit is https://hg.mozilla.org/releases/mozilla-release/annotate/tip/mozglue/linker/ElfLoader.cpp#l925.
:glandium, could you investigate please ?
Flags: needinfo?(mh+mozilla)
From GP, there are 225149 crashes in 59.0.2 and 180234 in 59.0.1.
For 95% of the crashes, android version is 8.0 and for 85% of the crashes the device is a Galaxy S8.
Without a full crash report I can't say anything. And I don't have a device where I could put Android 8.0 on to reproduce.
Flags: needinfo?(mh+mozilla)
Perhaps https://bit.ly/2GMljCL may be the crash, as most of the devices in that list are Samsung S8 devices running API26. Currently that signature is #21 top crash in release.
Not the same crash at all unfortunately.
Mike I have a Galaxy S8 with Oreo, let me know how I can help produce a crash report.
Flags: needinfo?(mh+mozilla)
The first thing would be have a complete copy of the crash report emitted by android itself, which normally appears in logcat.
Flags: needinfo?(mh+mozilla) → needinfo?(sdaswani)
I don't have any STRs though?
Flags: needinfo?(sdaswani) → needinfo?(mh+mozilla)
(In reply to :sdaswani from comment #10)
> I don't have any STRs though?

I think if you are on the previous OS and update to Oreo you should get the crash. That's what many of Google Play Store comments mention.
Ioana is going to try and STR tomorrow.
Flags: needinfo?(ioana.chiorean)
Flags: needinfo?(mh+mozilla)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #2)
> Tom, I noticed we are only testing releases with Android 7
> (https://wiki.mozilla.org/QA/Fennec/59/0.2/RC). Should we start testing 8.0
> as well, for Firefox 60?

It's best to ask No-Jun, since it's his team testing it.
Flags: needinfo?(tgrabowski) → needinfo?(nojun.park)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #2)
> Tom, I noticed we are only testing releases with Android 7
> (https://wiki.mozilla.org/QA/Fennec/59/0.2/RC). Should we start testing 8.0
> as well, for Firefox 60?

(In reply to Tom Grabowski [:TomGrab] from comment #13)
> (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #2)
> > Tom, I noticed we are only testing releases with Android 7
> > (https://wiki.mozilla.org/QA/Fennec/59/0.2/RC). Should we start testing 8.0
> > as well, for Firefox 60?
> 
> It's best to ask No-Jun, since it's his team testing it.


Hi Liz, Tom, No-Jun
The link in the example was the second respin and it was a regression/light testing on it done as it was previously testes thoroughly. 

Most of times we do test several other build too ( we exchange devices/Operating systems/manufactures.)
https://wiki.mozilla.org/QA/Fennec/60/Beta/3/smoketests-results
https://wiki.mozilla.org/QA/Fennec/60/Beta/5/smoketests-results
https://wiki.mozilla.org/QA/Fennec/59/RC3 - the original release andidate 
etc. 

We did try it today with the S8 and will comment at the end of the day with results.
Flags: needinfo?(ioana.chiorean) → needinfo?(bogdan.surd)
Device:
-Samsung Galaxy S8 with Android 7.0 updated to Android 8.0

Hello,

I have installed Nightly, Beta and Release from the Play Store after browsing for a while went on to update my Samsung device from Android N to O.

Ran light smoke tests on the device as well as accessing some of the pages mentioned by Kevin in the mail on each of the builds and only managed to get one crash in Nightly without any clear or reliable STR.

Here is the crash report from Nightly: https://crash-stats.mozilla.com/report/index/3cd969d4-3cd4-45e2-b056-0d4c60180405

Will continue testing and post anything else or relevance that I might find.
Flags: needinfo?(bogdan.surd)
I think we should test as in comment 11. The crash seems to be with the update from 7.x to 8.0. 
The urls Kevin mentioned aren't necessarily associated with this crash.
Should this be linked to the meta bug 1450105?
Bogdan, do you have more details on the hardware? 

It looks like there are many different model numbers and variants for the S8: http://www.teamandroid.com/2017/04/13/samsung-galaxy-s8-model-numbers-variants/   

But only a few code names are affected (listed in comment 0).
So I wonder if we can come up with the code name / model number and figure that out. It could be that the hardware you're testing isn't one of the affected ones.
Flags: needinfo?(bogdan.surd)
I would bet that is our issue.  Here are the affected model numbers. They probably use a different chip, or something.

U.S. S8 SM-G950U1 (dreamqlteue)
U.S. S8 plus SM-G955U1 (dream2qlteue)
Canadian S8 SM-G950W (dreamqltecan)
Canadian S8+ SM-G955W (dream2qltecan)
Australian Galaxy S8 SCV36

Andreas, Kevin, maybe we should move forward to get one of the US or Canadian devices for testing this.
Flags: needinfo?(kbrosnan)
Flags: needinfo?(abovens)
Certainly. Let's do it.
Flags: needinfo?(abovens)
Yes. The US/Canada models use a Qualcomm SOC and the other variants use Samsung's Exynos SOC. I'll see what I can do about picking one up, likely best to buy -> expense over SN.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #18)
> Bogdan, do you have more details on the hardware? 

I have the SM-G950F model. Don't have any other model on hand to test this with.

(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #16)
> I think we should test as in comment 11. The crash seems to be with the
> update from 7.x to 8.0. 
> The urls Kevin mentioned aren't necessarily associated with this crash.

This is exactly how I tested the issue, by upgrading from Android 7 to Android 8. Sadly everything worked as expected for my device model on all versions of FF.
Flags: needinfo?(bogdan.surd)
The symbols indicate this is a platform crash so unfortunately I don't think I have the skill set to help here – Snorp might have more info.

> Should this be linked to the meta bug 1450105?

Those bugs seem to be front-end Android resource corruption while this one does not (it's a platform crash).

fwiw, Susheel may have one of the devices specified in comment 19.
Flags: needinfo?(michael.l.comella)
Thanks Mike. Since this seems like a platform issue we should let Snorp & Co. take it from here.
Duplicate of this bug: 1453325
See Also: → 1453326
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #0)
> This crash is showing up in high volume on the Play Store crash records but
> not on crash-stats, for Android 8.0 and up, on several different Galaxy S8
> devices:
> 
> 
> signal 11 (SIGSEGV), code 2 (SEGV_ACCERR)
> in libmozglue.so
> 
> *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** pid: 0, tid:
> 0 >>> org.mozilla.firefox <<< backtrace:
> 
> #00 pc 0000000000043d66
> /data/app/org.mozilla.firefox-cEa3LWSWhqBfUNxUk_l7Jw==/lib/arm/libmozglue.so

Is this the entirety of the info that Google Play provides for this crash?
(In reply to Calixte Denizet (:calixte) from comment #3)
> So the culprit is
> https://hg.mozilla.org/releases/mozilla-release/annotate/tip/mozglue/linker/
> ElfLoader.cpp#l925.

Given the fact that the line above that line is `EnsureWritable w(&dbg->r_map->l_prev);`, and the crash subcode is `SEGV_ACCERR`, it sounds like `EnsureWritable` is failing here. That method doesn't check the return value of `mprotect`:
https://hg.mozilla.org/releases/mozilla-release/annotate/tip/mozglue/linker/ElfLoader.cpp#l848

...although there's probably nothing useful to be done if that fails except crash anyway.
Andreas can you see if we can get more info from your G contact per comment 26?
Flags: needinfo?(abovens)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #26)
> (In reply to Liz Henry (:lizzard) (needinfo? me) from comment #0)
> > This crash is showing up in high volume on the Play Store crash records but
> > not on crash-stats, for Android 8.0 and up, on several different Galaxy S8
> > devices:
> > 
> > 
> > signal 11 (SIGSEGV), code 2 (SEGV_ACCERR)
> > in libmozglue.so
> > 
> > *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** pid: 0, tid:
> > 0 >>> org.mozilla.firefox <<< backtrace:
> > 
> > #00 pc 0000000000043d66
> > /data/app/org.mozilla.firefox-cEa3LWSWhqBfUNxUk_l7Jw==/lib/arm/libmozglue.so
> 
> Is this the entirety of the info that Google Play provides for this crash?

It's the only thing we've for this crash on the reports I read. It seems that there is no way to search in the crash reports on GP (nothing similar to Socorro).
I filed https://issuetracker.google.com/issues/77861779 but we don't have great info to put there yet. As soon as we have info from logcat, I think we should share it there.
(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #28)
> Andreas can you see if we can get more info from your G contact per comment
> 26?

I've contacted them.
Comment on attachment 8967190 [details]
Bug 1450793 - Guard against mprotect() failure when manipulating link map

https://reviewboard.mozilla.org/r/235848/#review241644

::: mozglue/linker/ElfLoader.cpp:992
(Diff revision 1)
>      /* When adding a library for the first time, r_map points to data
>       * handled by the system linker, and that data may be read-only */
>      EnsureWritable w(&dbg->r_map->l_prev);
> +    if (!w.IsWritable()) {
> +      writeDisabled = true;
> +      WARN("Unable to add library to debug list");

The distinction between "failed to add" and "failed to remove" is kind of useless, and you might as well move the logging to EnsureWritable itself. You should also just use one bool attached to the DebuggerHelper class instead of having one static per function, which both will end up with the same value, most likely.

Also, just mostly silently ignoring the error is not going to help us on the long run. We should at the very least log *why* mproctect failed, logging the values it's passed, and the errno it returned. Even better would be getting that information somehow.

With all that being said, I'm wondering if the devices where this is happening have a non-4k page size. Do we have that information in the crash reports?
Attachment #8967190 - Flags: review?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #33)
> Comment on attachment 8967190 [details]
> Bug 1450793 - Guard against mprotect() failure when manipulating link map
> 
> https://reviewboard.mozilla.org/r/235848/#review241644
> 
> ::: mozglue/linker/ElfLoader.cpp:992
> (Diff revision 1)
> >      /* When adding a library for the first time, r_map points to data
> >       * handled by the system linker, and that data may be read-only */
> >      EnsureWritable w(&dbg->r_map->l_prev);
> > +    if (!w.IsWritable()) {
> > +      writeDisabled = true;
> > +      WARN("Unable to add library to debug list");
> 
> The distinction between "failed to add" and "failed to remove" is kind of
> useless, and you might as well move the logging to EnsureWritable itself.

The problem with this is that we don't have adequate context to generate a sane error message. All we know is that mprotect failed (and why), but not the reason we were trying to mprotect in the first place. "Failed to make 0xdeadbeef writable" is not useful.

> You should also just use one bool attached to the DebuggerHelper class
> instead of having one static per function, which both will end up with the
> same value, most likely.

Sure.

> 
> Also, just mostly silently ignoring the error is not going to help us on the
> long run. We should at the very least log *why* mproctect failed, logging
> the values it's passed, and the errno it returned. Even better would be
> getting that information somehow.

Yeah, just trying to stop the bleeding for now, but I'll add that.

> 
> With all that being said, I'm wondering if the devices where this is
> happening have a non-4k page size. Do we have that information in the crash
> reports?

We don't have the page size anywhere AFAICT, but it appears[0] that 4k, 16k, and 64k are all possible on arm64. We can query this via sysconf(), so I'll add another that does this and uses the correct alignment.

[0] https://github.com/torvalds/linux/blob/master/arch/arm64/Kconfig#L264
Flags: needinfo?(snorp)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #34)
> (In reply to Mike Hommey [:glandium] from comment #33)
> > Comment on attachment 8967190 [details]
> > Bug 1450793 - Guard against mprotect() failure when manipulating link map
> > 
> > https://reviewboard.mozilla.org/r/235848/#review241644
> > 
> > ::: mozglue/linker/ElfLoader.cpp:992
> > (Diff revision 1)
> > >      /* When adding a library for the first time, r_map points to data
> > >       * handled by the system linker, and that data may be read-only */
> > >      EnsureWritable w(&dbg->r_map->l_prev);
> > > +    if (!w.IsWritable()) {
> > > +      writeDisabled = true;
> > > +      WARN("Unable to add library to debug list");
> > 
> > The distinction between "failed to add" and "failed to remove" is kind of
> > useless, and you might as well move the logging to EnsureWritable itself.
> 
> The problem with this is that we don't have adequate context to generate a
> sane error message. All we know is that mprotect failed (and why), but not
> the reason we were trying to mprotect in the first place. "Failed to make
> 0xdeadbeef writable" is not useful.

There are only three places where EnsureWritable is used, two are the ones you changed, which it doesn't really matter which is which, because in the end, they touch the same thing anyway, and the third being the diversion of sigaction. Any of those failing is about as problematic. I'm not really convinced that knowing exactly which failed makes a significant difference, especially considering they really shouldn't be failing in the first place. Actually looking at the code, it turns out the third is never happening anyways, ever since bug 1376704... So I guess the last paragraph of the commit message for that bug was being too cautious, and we can remove all that stuff now.

> > With all that being said, I'm wondering if the devices where this is
> > happening have a non-4k page size. Do we have that information in the crash
> > reports?
> 
> We don't have the page size anywhere AFAICT, but it appears[0] that 4k, 16k,
> and 64k are all possible on arm64. We can query this via sysconf(), so I'll
> add another that does this and uses the correct alignment.
> 
> [0] https://github.com/torvalds/linux/blob/master/arch/arm64/Kconfig#L264

That's why I'm asking what the real page size was. A different page size is about the only thing that would make sense to cause an error. If it's not that, then I can only assume Android 8 has some changes that over-protect some pages.
Comment on attachment 8967190 [details]
Bug 1450793 - Guard against mprotect() failure when manipulating link map

https://reviewboard.mozilla.org/r/235848/#review241666

::: mozglue/linker/ElfLoader.cpp:908
(Diff revision 2)
>        return;
>  
>      page = firstPage;
> -    mprotect(page, length, prot | PROT_WRITE);
> +    success = (mprotect(page, length, prot | PROT_WRITE) == 0);
> +    if (!success) {
> +      WARN("Failed to make %p - 0x%x writable: %s\n", ptr,

The ptr-ptr+length range is not what we're trying to make writable. page-page+length is. prot would be a good thing to log too. Overall, I'd just advise logging strace-style.
Attachment #8967190 - Flags: review?(mh+mozilla)
Comment on attachment 8967240 [details]
Bug 1450793 - Don't assume 4k page size

https://reviewboard.mozilla.org/r/235942/#review241668

::: mozglue/linker/Utils.h:110
(Diff revision 2)
>  /**
>   * Page alignment helpers
>   */
> -static inline size_t PageSize()
> +static size_t PageSize()
>  {
> -  return 4096;
> +  static size_t pgsz = 0;

not really great to have a static variable in a static function, which means there's going to have an instance of the static variable in each compilation unit the header is included in.

Might as well use a global atomic variable.
Attachment #8967240 - Flags: review?(mh+mozilla)
I'm thinking, doing both changes at the same time will undoubtedly fix the issue... without giving us any insight wrt which of the two patches actually fixed it... and that doesn't sound great. How about landing the page size patch first, and see how things go from there?
Attachment #8967190 - Attachment is obsolete: true
(In reply to Mike Hommey [:glandium] from comment #42)
> I'm thinking, doing both changes at the same time will undoubtedly fix the
> issue... without giving us any insight wrt which of the two patches actually
> fixed it... and that doesn't sound great. How about landing the page size
> patch first, and see how things go from there?

Yup, fine with me.
Comment on attachment 8967240 [details]
Bug 1450793 - Don't assume 4k page size

https://reviewboard.mozilla.org/r/235942/#review242002

::: mozglue/linker/Utils.h:106
(Diff revision 3)
>    static FILE *empty() { return nullptr; }
>    static void release(FILE *f) { if (f) fclose(f); }
>  };
>  typedef mozilla::Scoped<AutoCloseFILETraits> AutoCloseFILE;
>  
> +extern mozilla::Atomic<size_t> gPageSize;

Atomic<size_t, ReleaseAcquire>

::: mozglue/linker/Utils.cpp:7
(Diff revision 3)
> + * License, v. 2.0. If a copy of the MPL was not distributed with this file,
> + * You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#include "Utils.h"
> +
> +mozilla::Atomic<size_t> gPageSize;

You could put this in ElfLoader.cpp.
Attachment #8967240 - Flags: review?(mh+mozilla) → review+
Tracking this for 60 as well.
Comment on attachment 8967240 [details]
Bug 1450793 - Don't assume 4k page size

Approval Request Comment
[Feature/Bug causing the regression]: None, caused by kernel changes on some devices
[User impact if declined]: Random startup crashes
[Is this code covered by automated tests?]: Yes
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No
[Why is the change risky/not risky?]: Will use the same page size as old code in the vast majority of cases.
[String changes made/needed]: None
Attachment #8967240 - Flags: approval-mozilla-release?
Attachment #8967240 - Flags: approval-mozilla-beta?
Assignee: nobody → snorp
Summary: Top crash by far on 59.0 Android builds → Kernel changes on some devices are causing a huge crash spike (especially on S8)
https://hg.mozilla.org/mozilla-central/rev/e73f2e16ec41
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 61
Comment on attachment 8967240 [details]
Bug 1450793 - Don't assume 4k page size

Approved for 60.0b13. Let's hope it helps!
Attachment #8967240 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(abovens)
I think I've had some success reproducing this. Crashes within a few seconds of opening a new tab, often when Firefox has been idle for a while.
Flags: needinfo?(kbrosnan)
We only have a little bit of early data from beta 13 on the Play Store, but that early feedback is showing the crash rate still just as high (around 4.2% for Android 8.0).
we're about to build 60rc, too late for 59
Flags: needinfo?(nojun.park)
Attachment #8967240 - Flags: approval-mozilla-release? → approval-mozilla-release-
a user at https://old.reddit.com/r/firefox/comments/8g5tpa/firefox_crash_oom/ is reporting about random OOM crashes on a S8 with Oreo.
Reopening this issue since the high crash volume still exists. Or, we can open a new meta bug for the problem if people prefer that. 

Attempted fixes: bug 1455662, bug 1460989, as well as the patches in this bug.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Tracking bug 1460989 instead since we're backing patches out from there.
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.