Context.startForegroundService() did not then call Service.startForeground()

RESOLVED FIXED in Firefox 63

Status

()

defect
P1
normal
RESOLVED FIXED
6 months ago
3 months ago

People

(Reporter: lizzard, Assigned: petru)

Tracking

({regression})

Firefox 63
Firefox 65
All
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox63+ fixed, firefox64 fixed, firefox65 fixed)

Details

(Whiteboard: --do_not_change--[priority:high])

Attachments

(1 attachment)

This hang is spiking with 63 release with 2300 reports today affecting 1600 users.
It seems to be spread fairly evenly across different devices.

Updated

6 months ago
Whiteboard: --do_not_change--[priority:high]
Looks as if the top affected devices are all Samsung Galaxy devices:

Galaxy S8 (dreamlte) 	        
Galaxy S9+ (star2qltesq) 	
Galaxy S7 (herolte) 	        
Galaxy Note8 (greatqlte) 	
Galaxy S9 (starqltesq)
I forgot to add all the users were running Android 8.0.

Updated

6 months ago
OS: Unspecified → Android
Hardware: Unspecified → All
(Assignee)

Updated

6 months ago
Assignee: nobody → petru.lingurar
Status: NEW → ASSIGNED
(In reply to Petru-Mugurel Lingurar[:petru] from comment #3)
> I cannot find recent reports for this
> https://crash-stats.mozilla.com/topcrashers/
> ?product=FennecAndroid&version=63.0&days=3&_facets_size=50

It is possible that somehow a crash report isn't getting generated. Sometimes we see things in GPC that don't necessarily show up in crash stats.
Context.startForegroundService() did not then call Service.startForeground() also shows up as an issue in beta - just checked the Google Play Console. There is also another similar error above that one in beta:

Context.startForegroundService() did not then call Service.startForeground(): ServiceRecord{ef8fc2c u0 org.mozilla.firefox_beta/org.mozilla.gecko.CrashReporterService}
(Assignee)

Comment 6

6 months ago
As I understand this was flagged by Google Play right after the 63 launch.
But it doesn't appear in our crash reports (hint 1).

There are 6 cases in which we call `startForegroundService()` [1]
In 4 of those cases `startForeground()` immediately follows in Service's `onStartCommand()` so this should work nicely.

In two cases `startForegroundService()` is used to start CrashHandlerService which does not call `startForeground()` with a Notification [2].
We knew this and decided that it was better to omit it so to not bug the user and also because CrashHandlerService's work takes very little time.
Because we are not posting a notification for CrashHandlerService's background work, after ~5 seconds ActivityManager posts an ANR in Android's log with the message `Context.startForegroundService() did not then call Service.startForeground()`. Nothing gets through to the user though.

I think this might be the cause for this recent errors. They haven't appeared before the release and they don't appear in our crash stats because they are transparent to the user. But Google Play still logs them because they do appear in Android's log.

I saw that :snorp updated the crash reporting API in bug 1483329 and now `CrashHandlerService` calls `System.exit(0)` to shutdown the JVM before the ANR [3].
To not wait for 64 we could 
- bring only this small change down to 63
- actually use a user visible Notification. This would need it's own separate Notification Channel though and I think will need to be supported in the future (although there would be no need for the Notification in 64 with the updated crash reporting API).
Because of this I think we should go with just bringing down that small change: terminate the app immediately after `CrashHandlerService` finishes it's work and so avoid that ANR.

Please keep in mind that this would be a speculative fix as I do not have any reports/stacktrace/STR.


[1] https://dxr.mozilla.org/mozilla-central/search?q=startForegroundService&redirect=true
[2] https://hg.mozilla.org/releases/mozilla-release/file/84219fbf133c/mobile/android/geckoview/src/main/java/org/mozilla/gecko/CrashReporterService.java#l80
[3] https://dxr.mozilla.org/mozilla-central/source/mobile/android/base/java/org/mozilla/gecko/CrashHandlerService.java#21
(Assignee)

Comment 7

6 months ago
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #5)
> Context.startForegroundService() did not then call Service.startForeground()
> also shows up as an issue in beta - just checked the Google Play Console.
> There is also another similar error above that one in beta:
> 
> Context.startForegroundService() did not then call
> Service.startForeground(): ServiceRecord{ef8fc2c u0
> org.mozilla.firefox_beta/org.mozilla.gecko.CrashReporterService}

And here is the reference to to CrashReporterService as I suspected.

So I think there are so many of this errors because this would be posted in Android's log for each and every error our crash reporter catches. But they do not get through to the user.
Still, to keep GP's Console clean I think that before FF 64 which should avoid this situation we could go on the same route and terminate the app immediately after `CrashReporterService` finishes its work to avoid that ANR.
There is a report at the bottom of the page where this is called out, it starts with what is below, but there are also further stacks - if it is useful I can cut and paste and add as an attachment:

"main" tid=1 Native
"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x7329c4d8 self=0xeb9f5000
  | sysTid=24993 nice=0 cgrp=default sched=0/0 handle=0xefe0c4bc
  | state=S schedstat=( 88368963 1142576 103 ) utm=5 stm=3 core=1 HZ=100
  | stack=0xff0be000-0xff0c0000 stackSize=8MB
  | held mutexes=
 
  #00  pc 000000000004a090  /system/lib/libc.so (__epoll_pwait+20)
 
  #01  pc 000000000001b77d  /system/lib/libc.so (epoll_pwait+60)
 
  #02  pc 000000000001b7ad  /system/lib/libc.so (epoll_wait+12)
 
  #03  pc 0000000000010343  /system/lib/libutils.so (_ZN7android6Looper9pollInnerEi+118)
 
  #04  pc 0000000000010235  /system/lib/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+32)
 
  #05  pc 00000000000e4ba1  /system/lib/libandroid_runtime.so (_ZN7android18NativeMessageQueue8pollOnceEP7_JNIEnvP8_jobjecti+24)
 
  #06  pc 00000000001a627d  /system/framework/arm/boot-framework.oat (Java_android_os_MessageQueue_nativePollOnce__JI+92)
 
  at android.os.MessageQueue.nativePollOnce (Native method)
 
  at android.os.MessageQueue.next (MessageQueue.java:325)
 
  at android.os.Looper.loop (Looper.java:142)
 
  at android.app.ActivityThread.main (ActivityThread.java:6944)
 
  at java.lang.reflect.Method.invoke (Native method)
 
  at com.android.internal.os.Zygote$MethodAndArgsCaller.run (Zygote.java:327)
 
  at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:1374)
(Assignee)

Comment 9

6 months ago
This service is started as a foreground service.
But because it does little work it doesn't post a Notification to bug the user
unneededly.
Although all it's work is done in < 5 seconds, after them passing the
ActivityManager will post an ANR error to logcat because we aren't calling
Service.startForeground() and actually posting a Notification.
Although nothing gets through to the user this errors are registered in the
Google Play console.
To prevent them we terminate the currently running JVM immedately after
CrashReporterService's work is completed, which is under those ~5 seconds.

The call for this is put into onUnbind() because in my tests if that call is put
at the end of onHandleWork() the JVM would be shutdown before the JobScheduler
would have had the chance to unbind and so after some time the service would be
binded again, and onHandleWork would be executed again.
onUbind() is called immediately after onHandleWork() finishes, after JobScheduler
unbinds, so this seems the earliest possible moment to terminate.
This moved back into the top ANR spot on the GPC and we received another alert on 10-27.
(Assignee)

Comment 11

6 months ago
This ANR is logged in Google Play console for every crash caught by our crash reporter.
But it should not get through to the user.
(A simple test would be going to "about:crashparent" . This would crash the browser and display the CrashReporterActivity which allows to user to enter details about the crash. After ~5 seconds the ANR would be posted in logs, but this would not cause any other issue to the user)
(In reply to Petru-Mugurel Lingurar[:petru] from comment #11)
> This ANR is logged in Google Play console for every crash caught by our
> crash reporter.
> But it should not get through to the user.
> (A simple test would be going to "about:crashparent" . This would crash the
> browser and display the CrashReporterActivity which allows to user to enter
> details about the crash. After ~5 seconds the ANR would be posted in logs,
> but this would not cause any other issue to the user)

Is there a reason it primarily is happening on Android 8?
Randall, we would like to have this patch in a Fennec dot release in the days to come, can you give us a status update on the review of the patch? Thanks
Flags: needinfo?(rbarker)
Flags: needinfo?(rbarker)
(Assignee)

Comment 15

6 months ago
Comment on attachment 9020360 [details]
Bug 1501449 - Terminate the JVM immediately after CrashReporterService finishes to prevent ANR; r?sdaswani

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1467461

User impact if declined: Nothing should be visible to the user but the Google Play Console would be clogged with ANR reports.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: No

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): Low risk because it is a small change, tested locally.
It mimics what is already done on Beta 64 with no reported issues.

String changes made/needed:
Attachment #9020360 - Flags: approval-mozilla-release?
Marking this fixed for 64 and 65 from comment 6 -- it was fixed in bug 1483329. 
We still need to land the patch here on m-r.
Comment on attachment 9020360 [details]
Bug 1501449 - Terminate the JVM immediately after CrashReporterService finishes to prevent ANR; r?sdaswani

Minimal patch already on nightly and beta with no regression, approved for our next Fennec dot release thanks.

[Triage Comment]
Attachment #9020360 - Flags: approval-mozilla-release? → approval-mozilla-release+
Depends on: 1483329
Target Milestone: --- → Firefox 65
Status: ASSIGNED → RESOLVED
Last Resolved: 6 months ago
Resolution: --- → FIXED
(Reporter)

Updated

3 months ago
Priority: -- → P1
You need to log in before you can comment on or make changes to this bug.