Galaxy S8 SEGV_ACCERR crash in v60+

RESOLVED FIXED in Firefox 61

Status

()

defect
P1
normal
RESOLVED FIXED
Last year
Last year

People

(Reporter: kbrosnan, Assigned: jchen)

Tracking

Trunk
Firefox 62
ARM64
Android
Points:
---

Firefox Tracking Flags

(firefox60+ wontfix, firefox61+ fixed, firefox62+ fixed)

Details

Attachments

(1 attachment)

Reporter

Description

Last year
[Tracking Requested - why for this release]: This crash continues to show up in 60+.

05-11 16:05:47.874 +0000 15126 15126 F DEBUG   : Build fingerprint: 'samsung/dreamqlteue/dreamqlteue:8.0.0/R16NW/G950U1UEU2CRB9:user/release-keys'
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   : Revision: '12'
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   : ABI: 'arm'
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   : pid: 15101, tid: 15113, name: Binder:15101_2  >>> org.mozilla.firefox.PasswordsProvider <<<
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   : signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), fault addr 0xf4669124
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   :     r0 f4669114  r1 00001000  r2 c0000000  r3 f466a000
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   :     r4 d309f91c  r5 d2d0707c  r6 d2d08040  r7 d2d0805c
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   :     r8 00000000  r9 f22bd0b4  sl d46a261c  fp 00000000
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   :     ip f085a634  sp d46a25e8  lr f081c515  pc d2f9c2ba  cpsr 000f0030
05-11 16:05:47.876 +0000 15126 15126 F DEBUG   : 
05-11 16:05:47.876 +0000 15126 15126 F DEBUG   : backtrace:
05-11 16:05:47.876 +0000 15126 15126 F DEBUG   :     #00 pc 000142ba  /data/app/org.mozilla.firefox-rV4vVmDoVTl9JHHpgaa4dw==/lib/arm/libmozglue.so (offset 0x3c000)
05-11 16:05:47.876 +0000 15126 15126 F DEBUG   :     #01 pc 00052511  /system/lib/libc.so (fclose+160)
05-11 16:05:47.876 +0000 15126 15126 F DEBUG : #02 pc 00006fff [anon:jemalloc:d2d00000]

Captured a couple logcats https://gist.github.com/kbrosnan/1c26e5ca2e5f58cb2fa7484497aa8a18 and https://gist.github.com/kbrosnan/888a15dfcb6be383fc4348ec73c43cb5
There are a couple versions of this I see in the Google Play console. This first one has 12k reports in the last week and the stack seems to be consistent among the reports I looked at:

signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), __dl__ZN6soinfo19increment_ref_countEv
  #00  pc 0000000000013cd6  /system/bin/linker (__dl__ZN6soinfo19increment_ref_countEv+9)
  #01  pc 0000000000005ef3  /system/bin/linker (__dl__ZL12find_libraryP19android_namespace_tPKciPK17android_dlextinfoP6soinfo+54)
  #02  pc 0000000000005c6f  /system/bin/linker (__dl__Z9do_dlopenPKciPK17android_dlextinfoPKv+1110)
  #03  pc 000000000000343b  /system/bin/linker (__dl__ZL10dlopen_extPKciPK17android_dlextinfoPKv+38)
  #04  pc 000000000000103d  /system/lib/libdl.so (dlopen+4)
  #05  pc 000000000000eddf  /system/lib/libEGL.so (_ZN7androidL12load_wrapperEPKc+46)
  #06  pc 000000000000e437  /system/lib/libEGL.so (_ZN7android6Loader4openEPNS_16egl_connection_tE+286)
  #07  pc 0000000000009c0b  /system/lib/libEGL.so (_ZN7android16egl_init_driversEv+54)
  #08  pc 0000000000009de7  /system/lib/libEGL.so (eglGetDisplay+66)
  #09  pc 00000000000a6acb  /system/lib/libandroid_runtime.so (_ZL24android_eglGetDisplayIntP7_JNIEnvP8_jobjecti+26)
  #10  pc 0000000000c148cb  /system/framework/arm/boot-framework.oat (android.hardware.usb.UsbDevice.native_get_device_name [DEDUPED]+98)
  #11  pc 0000000000e61fb1  /system/framework/arm/boot-framework.oat (android.os.-$Lambda$6x30vPJhBKUfNY8tswxuZo3DCe0.run+56)
  #12  pc 00000000001ee141  /system/framework/arm/boot.oat (java.lang.Thread.run+64)
  #13  pc 00000000003dcee1  /system/lib/libart.so (art_quick_invoke_stub_internal+64)
  #14  pc 00000000003e1455  /system/lib/libart.so (art_quick_invoke_stub+228)
  #15  pc 00000000000ac605  /system/lib/libart.so (_ZN3art9ArtMethod6InvokeEPNS_6ThreadEPjjPNS_6JValueEPKc+140)
  #16  pc 000000000033246d  /system/lib/libart.so (_ZN3artL18InvokeWithArgArrayERKNS_33ScopedObjectAccessAlreadyRunnableEPNS_9ArtMethodEPNS_8ArgArrayEPNS_6JValueEPKc+52)
  #17  pc 00000000003332e1  /system/lib/libart.so (_ZN3art35InvokeVirtualOrInterfaceWithJValuesERKNS_33ScopedObjectAccessAlreadyRunnableEP8_jobjectP10_jmethodIDP6jvalue+320)
  #18  pc 00000000003508f9  /system/lib/libart.so (_ZN3art6Thread14CreateCallbackEPv+892)
  #19  pc 000000000004881f  /system/lib/libc.so (_ZL15__pthread_startPv+22)
  #20  pc 000000000001b2a1  /system/lib/libc.so (__start_thread+32)


There are several variations of this second one which has about 1200 reports in the last week:

signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), libmozglue.so (Galaxy S8 and S9)
  #00  pc 00000000000142ba  /data/app/org.mozilla.firefox-RbXXu2v-RzvP6BI9F1vlIA==/lib/arm/libmozglue.so
  #01  pc 000000000001499b  /data/app/org.mozilla.firefox-RbXXu2v-RzvP6BI9F1vlIA==/lib/arm/libmozglue.so

signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), libmozglue.so (Galaxy S6) 
  #00  pc 000000000001299a  /data/app/org.mozilla.firefox-1/lib/arm/libmozglue.so
  #01  pc 000000000001298f  /data/app/org.mozilla.firefox-1/lib/arm/libmozglue.so

signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), libmozglue.so (Galaxy J7) 
  #00  pc 0000000000012ab2  /data/app/org.mozilla.firefox-1/lib/arm/libmozglue.so
  #01  pc 0000000000012aa7  /data/app/org.mozilla.firefox-1/lib/arm/libmozglue.so

signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), libmozglue.so (Galaxy S8) 
  #00  pc 000000000000fdec  /data/app/org.mozilla.firefox-1/lib/arm/libmozglue.so
  #01  pc 000000000000fd9f  /data/app/org.mozilla.firefox-1/lib/arm/libmozglue.so
The attached logcat shows that we are launching Gecko inside GeckoService, which is probably a good clue. We start up Gecko there for a few reasons[0], but I think the most likely is that we're warming up in GeckoCustomTabsService[1]. Maybe that's racing when the Activity startup?

[0] https://dxr.mozilla.org/mozilla-release/source/mobile/android/base/java/org/mozilla/gecko/GeckoService.java?q=GeckoService.java&redirect_type=direct#197
[1] https://dxr.mozilla.org/mozilla-release/source/mobile/android/base/java/org/mozilla/gecko/customtabs/GeckoCustomTabsService.java#51
Reporter

Comment 4

Last year
This crash happened in the background while using another app https://gist.github.com/kbrosnan/8c9cfc656a7267c45666ae4c70911f03
(In reply to Kevin Brosnan [:kbrosnan] from comment #4)
> This crash happened in the background while using another app
> https://gist.github.com/kbrosnan/8c9cfc656a7267c45666ae4c70911f03

snorp: you can see the Android Application instance doesn't exist so onCreate is getting called:

https://searchfox.org/mozilla-central/source/mobile/android/base/java/org/mozilla/gecko/GeckoApplication.java#265

and that's the log message

05-16 14:42:35.050 16503 16503 I GeckoApplication: zerdatime 452443751 - application start

This is a separate process, mind you, since what's about to happen is that Sync is going to ask for passwords:

05-16 14:42:35.144 16503 16515 E GeckoLibLoad: Load sqlite start
05-16 14:42:35.167 16503 16515 F libc : Fatal signal 11 (SIGSEGV), code 2, fault addr 0xf4669124 in tid 16515 (Binder:16503_2)

For whatever reason, we can't unpack and load the libraries in the PasswordProvider process on these devices.  Maybe it's when Gecko is running already; maybe it's when Gecko is not yet running.  This is all very racy; it's hard to remember that we can have _two copies of the App Java code_ in play.
Good discussion and some hints but still mysterious - maybe Glandium has thoughts?
Flags: needinfo?(mh+mozilla)
Priority: -- → P1
what does addr2line say the corresponding symbols are in libmozglue.so?
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(kbrosnan)
Reporter

Comment 8

Last year
What is the question to me?
Flags: needinfo?(kbrosnan)
Do you have an answer to comment 7, or at least the exact versions of fennec for which we have crash reports/logs?
Reporter

Comment 10

Last year
I don't know what addr2line is. 

There is not a regression range as this bug is the result of Samsung pushing the Android 8.0 release and finding one with ~1/day reproducibility would be quite difficult. See previous issues bug 1455662 and bug 1450793.

Email context from around April 2nd.

Liz Henry: 
From the Google Play interface, our crash rate seems to have gone rocketing up recently with release 59. Most of these crashes are with Android 8.0 on Galaxy 8 devices.  I'm wondering if this is a 8.0 specific bug, and if we have contacts at Samsung to talk to. 

We also might want to start testing on 8.0 and up, if we aren't already. It doesn't look like this crash spike is reflected in crash-stats reports. Any ideas for next steps?  

Me:
It looks like the problem is mostly with the Galaxy S8. About 2 weeks ago Samsung and carriers started pushing out Android 8.0 (Oreo API 26 & 27) to the Samsung Galaxy S8 [1]. There would not be a way to catch this in Firefox beta as those phones were running 7.0 or 7.1 at the time. Getting our hands on an upgraded North American S8 would be useful. 

[1] https://news.google.com/news/search/section/q/galaxy%20s8%20oreo/galaxy%20s8%20oreo?hl=en&gl=US&ned=us
Reporter

Comment 11

Last year
The logs I have attached are from 60.0 or 60.0.1 ARMv7 API 16 build. The timestamps from the logs should be accurate.
(In reply to Kevin Brosnan [:kbrosnan] from comment #11)
> The logs I have attached are from 60.0 or 60.0.1 ARMv7 API 16 build. The
> timestamps from the logs should be accurate.

Do you know which is which?
Reporter

Comment 13

Last year
From looking at archive.mozilla.org they all predate the 60.0.1 release. Builds did not exist until 16-May-2018 00:56 and were pushed to the release folder around 16-May-2018 17:56.
So, the crash is at:
05-11 16:05:47.876 +0000 15126 15126 F DEBUG : #00 pc 000142ba /data/app/org.mozilla.firefox-rV4vVmDoVTl9JHHpgaa4dw==/lib/arm/libmozglue.so (offset 0x3c000)

which would correspond to this:
   502ae:       b570            push    {r4, r5, r6, lr}
   502b0:       b086            sub     sp, #24
   502b2:       4604            mov     r4, r0
   502b4:       460d            mov     r5, r1
   502b6:       e9d4 0100       ldrd    r0, r1, [r4]
-->502ba:       1a08            subs    r0, r1, r0
   502bc:       2101            movs    r1, #1
   502be:       eb01 01a0       add.w   r1, r1, r0, asr #2
   502c2:       4620            mov     r0, r4

And we have the following register values and crash address:
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   : signal 11 (SIGSEGV), code 2 (SEGV_ACCERR), fault addr 0xf4669124
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   :     r0 f4669114  r1 00001000  r2 c0000000  r3 f466a000
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   :     r4 d309f91c  r5 d2d0707c  r6 d2d08040  r7 d2d0805c
05-11 16:05:47.875 +0000 15126 15126 F DEBUG   :     r8 00000000  r9 f22bd0b4  sl d46a261c  fp 00000000
05-11 16:05:47.875 +0000 15126 15126 F DEBUG : ip f085a634 sp d46a25e8 lr f081c515 pc d2f9c2ba cpsr 000f0030

... which doesn't seem to be plausible. Even assuming the actual crash is on the deref of r4 on the previous instruction, then the fault address should be at 0xd309f91c (or close), not 0xf4669124.

If the crash is really happening at this location (0x502ba), that's in std::__ndk1::vector<const elf32_phdr *, std::__ndk1::allocator<const elf32_phdr *> >::__push_back_slow_path<const elf32_phdr *const &>, which disassembly says is only called from CustomElf::Load here:
https://dxr.mozilla.org/mozilla-central/rev/51f2535c797495f1f4e864072c2449b7c28669de/mozglue/linker/CustomElf.cpp#141

This is not making any sense.

I'm afraid the only way forward is to be able to reproduce it on a device and get more data. Who has access to a S8?
Reporter

Comment 15

Last year
I do. On the mobile team Randal Barker and Susheel Daswani also have this device. Dave Miller (Justdave) has an S8 as a primary device and says that he encounters the crash up to a few times a day. Though he is a sysadmin.
Kevin - can you provide the Samsung model number for the S8 you're seeing the issue on?
Flags: needinfo?(kbrosnan)
My S8 was extremely unstable last month, but then it became good again. Over the past two days it has been crashing again:
here are the last 2 crash reports:
https://crash-stats.mozilla.com/report/index/22f3397d-4732-400f-b545-3be460180522
https://crash-stats.mozilla.com/report/index/b562e850-9a86-49a9-9bee-293ae0180522
If anyone wants my S8 feel free to ask, I'll ask for a S9 replacement of course :)
Reporter

Comment 19

Last year
That info is in bug 1450793 comment 19.

jya your issue is a different bug. This only happens with North American models of the S8. Your phone has completely different hardware.
Flags: needinfo?(kbrosnan)
Are those logs from your device? If they are, in those logs, there are lines like "05-22 15:37:36.620 1074 1074 E /system/bin/tombstoned: Tombstone written to: /data/tombstones//tombstone_00"

Could you attach those tombstore files?
Flags: needinfo?(kbrosnan)
Reporter

Comment 23

Last year
Getting permission denied trying to access the files. Looks like it requires root to access the tombstone files. From a quick search there are some options for rooting the Galaxy S8 w/ Android 8.0.
Flags: needinfo?(kbrosnan)
The crash from comment 0 is happening at [1]. We think mprotect() is lying about whether it succeeded or not.

[1] https://searchfox.org/mozilla-central/rev/dc6d85680539cb7e5fe2843d38a30a0581bfefe1/mozglue/linker/ElfLoader.cpp#1007
(In reply to Jim Chen [:jchen] [:darchons] from comment #24)
> The crash from comment 0 is happening at [1]. We think mprotect() is lying
> about whether it succeeded or not.
> 
> [1]
> https://searchfox.org/mozilla-central/rev/
> dc6d85680539cb7e5fe2843d38a30a0581bfefe1/mozglue/linker/ElfLoader.cpp#1007

That certainly makes more sense than comment 14, but how did you get that line number?
Assignee: nobody → nchen
(In reply to Mike Hommey [:glandium] from comment #25)
> (In reply to Jim Chen [:jchen] [:darchons] from comment #24)
> > The crash from comment 0 is happening at [1]. We think mprotect() is lying
> > about whether it succeeded or not.
> > 
> > [1]
> > https://searchfox.org/mozilla-central/rev/
> > dc6d85680539cb7e5fe2843d38a30a0581bfefe1/mozglue/linker/ElfLoader.cpp#1007
> 
> That certainly makes more sense than comment 14, but how did you get that
> line number?

I added the address (0x142ba) and file offset (0x3c000) from the crash dump plus an elfhack offset (0x1000) to get the virtual offset (0x512ba), and looked that up in the breakpad symbols.
Comment hidden (mozreview-request)
(In reply to Jim Chen [:jchen] [:darchons] from comment #26)
> (In reply to Mike Hommey [:glandium] from comment #25)
> > (In reply to Jim Chen [:jchen] [:darchons] from comment #24)
> > > The crash from comment 0 is happening at [1]. We think mprotect() is lying
> > > about whether it succeeded or not.
> > > 
> > > [1]
> > > https://searchfox.org/mozilla-central/rev/
> > > dc6d85680539cb7e5fe2843d38a30a0581bfefe1/mozglue/linker/ElfLoader.cpp#1007
> > 
> > That certainly makes more sense than comment 14, but how did you get that
> > line number?
> 
> I added the address (0x142ba) and file offset (0x3c000) from the crash dump
> plus an elfhack offset (0x1000) to get the virtual offset (0x512ba), and
> looked that up in the breakpad symbols.

Aaaaaah right, 0x3c000 is the *file* offset.

Comment 29

Last year
mozreview-review
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

https://reviewboard.mozilla.org/r/246834/#review253066

::: mozglue/linker/ElfLoader.cpp:914
(Diff revision 1)
>        return;
>      }
>  
>      page = firstPage;
>      int ret = mprotect(page, length, prot | PROT_WRITE);
> -    success = ret == 0;
> +    int newProt = getProt(start, &end);

Please skip this in the case where ret is not 0.

Also consider how much of a wtf this is (mprotect pretends there was no error while not doind what it's asked for), please add an explicit comment. We don't want someone looking at the code later and think "this looks useless, let's remove it".
Attachment #8980661 - Flags: review?(mh+mozilla)

Comment 30

Last year
mozreview-review
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

https://reviewboard.mozilla.org/r/246834/#review253070

::: mozglue/linker/ElfLoader.cpp:919
(Diff revision 1)
> -    success = ret == 0;
> +    int newProt = getProt(start, &end);
> +    success = (ret == 0) && (newProt != -1) && (newProt & PROT_WRITE);
>      if (!success) {
>        ERROR("mprotect(%p, %zu, %d) = %d (errno=%d; %s)",
>              page, length, prot | PROT_WRITE, ret,
>              errno, strerror(errno));

errno shouldn't be displayed when ret is 0.

I'd be interested in the output here. I wonder if the root cause is because of prot containing PROT_EXEC, and some OS-level protection forbidding executable pages to be turned writable, although that shouldn't be silent.
While you're there, please change the %d for prot to %o. I guess snorp missed the subtle difference in bug 1455662 comment 6.
Comment hidden (mozreview-request)

Comment 33

Last year
mozreview-review
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

https://reviewboard.mozilla.org/r/246834/#review253746

::: mozglue/linker/ElfLoader.cpp:927
(Diff revision 2)
> +    // success even after _failing_ to make the page writable. Therefore, check
> +    // for write access again instead of relying on the mprotect return value.
> +    int newProt = getProt(start, &end);
> +    success = (newProt != -1) && (newProt & PROT_WRITE);
>      if (!success) {
> -      ERROR("mprotect(%p, %zu, %d) = %d (errno=%d; %s)",
> +      WARN("mprotect(%p, %zu, %o) returned 0 but page is not writable: %o",

It should be ERROR like the other, or the other WARN like this one.
Attachment #8980661 - Flags: review?(mh+mozilla) → review+
Comment hidden (mozreview-request)

Comment 35

Last year
Pushed by nchen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a4465713555b
Check page protection flags again after mprotect(); r=glandium

Comment 36

Last year
bugherder
https://hg.mozilla.org/mozilla-central/rev/a4465713555b
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → Firefox 62
Please nominate this for Beta uplift so we can hopefully include it in Monday's b11 build.
Flags: needinfo?(nchen)
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

Approval Request Comment
[Feature/Bug causing the regression]: N/A
[User impact if declined]: Crash on some devices like Galaxy S8
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: No
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Not really
[Why is the change risky/not risky?]: The patch adds an additional protection against the crash, but doesn't otherwise affect behavior.
[String changes made/needed]: None
Flags: needinfo?(nchen)
Attachment #8980661 - Flags: approval-mozilla-beta?
Do we have the log of a failed madvise?
Flags: needinfo?(nchen)
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

More mprotect sanity checking to hopefully work around Galaxy S8 crashes. Approved for 61.0b11.
Flags: needinfo?(nchen)
Attachment #8980661 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(nchen)
(In reply to Mike Hommey [:glandium] from comment #39)
> Do we have the log of a failed madvise?

I haven't seen any. I think snorp has gone through more logs.
Flags: needinfo?(nchen) → needinfo?(snorp)
We are seeing an even bigger crash spike on the Play Store, again not reflected in crash-stats, for S8 devices. I am not sure how to pin down the cause, but if we see a similar spike on beta in a day or two then this might be the cause.    

Jim can you see those crash results? I can email you the link if you have the permissions.
Flags: needinfo?(nchen)
I don't think I have the permissions.
Flags: needinfo?(nchen)
(In reply to Jim Chen [:jchen] [:darchons] from comment #42)
> (In reply to Mike Hommey [:glandium] from comment #39)
> > Do we have the log of a failed madvise?
> 
> I haven't seen any. I think snorp has gone through more logs.

I also haven't seen any. Play Store doesn't give us logcat, and I don't think there are any failed mprotect() in kbrosnan's logs.
Flags: needinfo?(snorp)
Liz, can we get Jim access to the play store console? I think it may help if he can look through that directly.
Flags: needinfo?(lhenry)
Sylvestre, can you help with that since you have the permissions to add a new user?
Flags: needinfo?(lhenry) → needinfo?(sledru)
Sure, I sent an invite to nchen@mozilla.com

FYI, I have a Samsung S8 and with this bug, it is an unusable application.
Flags: needinfo?(sledru)
Reporter

Comment 49

Last year
I will bring this device to SF and maybe we can get this in front of a developer.
Backed out for increasing the crash rate as reported in the Play Console.

https://hg.mozilla.org/mozilla-central/rev/f7fd9b08c0156be5b5cd99de5ed0ed0b98d93051
Status: RESOLVED → REOPENED
Flags: needinfo?(nchen)
Resolution: FIXED → ---
Target Milestone: Firefox 62 → ---
Attachment #8980661 - Flags: approval-mozilla-beta+ → approval-mozilla-beta-
(In reply to Kevin Brosnan [:kbrosnan] from comment #49)
> I will bring this device to SF and maybe we can get this in front of a
> developer.

Kevin: Can you update this bug if you are able to do this?
Flags: needinfo?(kbrosnan)
Reporter

Comment 53

Last year
Yes I have the device.
Flags: needinfo?(kbrosnan)
Comment hidden (mozreview-request)
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

Basically the crash is caused by our linker racing with the system linker in changing the page protection flags. This patch fixes that by holding the system linker lock while changing the debug map.
Flags: needinfo?(nchen)
Attachment #8980661 - Flags: review+ → review?(mh+mozilla)
Attachment #8980661 - Flags: approval-mozilla-beta-

Comment 56

Last year
mozreview-review
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

https://reviewboard.mozilla.org/r/246834/#review257582

::: commit-message-d36cd:10
(Diff revision 4)
> +flags. To fix the race, we need to take the system linker's internal
> +lock when we perform any kind of modification on the debug map.
> +
> +One way to hold the system linker lock is to call dl_iterate_phdr, and
> +perform our actions inside the callback, which is invoked with the
> +locked being held. However, dl_iterate_phdr is only present on Android

s/locked/lock/

::: mozglue/linker/ElfLoader.cpp:321
(Diff revision 4)
> +/**
> + * Return the current Android version, or 0 on failure.
> + */
> +int
> +GetAndroidSDKVersion() {
> +  static int version;

maybe explicitly initialize to 0? (although that's what it is implicitly)

::: mozglue/linker/ElfLoader.cpp:632
(Diff revision 4)
>  {
>    Register(static_cast<LibHandle *>(handle));
>    if (dbg) {
> -    dbg.Add(handle);
> +    // We could race with the system linker when modifying the debug map, so
> +    // only do so while holding the system linker's internal lock.
> +    RunWithSystemLinkerLock([this, handle] { dbg.Add(handle); });

Ideally, the logging should be in DebuggerHelper::Add, but there is only this caller, and lambda'ing the whole function would not be much better, so I guess it's okay like this.

::: mozglue/linker/ElfLoader.cpp:661
(Diff revision 4)
>  {
>    Forget(static_cast<LibHandle *>(handle));
>    if (dbg) {
> -    dbg.Remove(handle);
> +    // We could race with the system linker when modifying the debug map, so
> +    // only do so while holding the system linker's internal lock.
> +    RunWithSystemLinkerLock([this, handle] { dbg.Remove(handle); });

Likewise
Attachment #8980661 - Flags: review?(mh+mozilla) → review+

Comment 57

Last year
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/a7384267f89c
Hold system linker lock while modifying debug map. r=glandium, a=RyanVM
I landed this with glandium's first two nits addressed. I've also triggered new Fennec nightlies so we can get crash data over the weekend. Jim, can you please request Beta approval on this when you get a chance? I'll wait to approve until after reviewing the Play Store crash data on Monday, but it would be great to get it into the b15 build that morning. Thanks!
Status: REOPENED → RESOLVED
Closed: Last yearLast year
Flags: needinfo?(nchen)
Resolution: --- → FIXED
Target Milestone: --- → Firefox 62
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

Approval Request Comment
[Feature/Bug causing the regression]: N/A
[User impact if declined]: Random crashes on Galaxy S8/S9
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: Tested locally
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Slightly
[Why is the change risky/not risky?]: We are adding a new code path on every startup, but that is fairly trivial and should be very stable.
[String changes made/needed]: None
Flags: needinfo?(nchen)
Attachment #8980661 - Flags: approval-mozilla-beta?
Unfortunately, due to bug 1464034, we haven't been able to ship this fix to Nightly users yet :-\
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

I'm not going to be able to take this for today's b15 build due to the lack of crash data from over the weekend, so I'm moving the approval request to release to keep it on the radar for RC inclusion. We were able to manually upload a Nightly APK to the Play Store yesterday, so that at least gets us unblocked here.
Attachment #8980661 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Comment on attachment 8980661 [details]
Bug 1460989 - Hold system linker lock while modifying debug map;

The Google Play console is showing a big drop-off in crash rate on Nightly builds \m/. Approved for the Fennec 61 RC.
Attachment #8980661 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.