Open Bug 1764605 Opened 3 years ago Updated 2 years ago

Common Firebase ANR in all mobile project main activities - input dispatcher time out

Categories

(GeckoView :: IME, defect, P2)

All
Android
defect

Tracking

(firefox-esr91 wontfix, firefox99 wontfix, firefox100 wontfix, firefox101 wontfix, firefox102 wontfix)

Tracking Status
firefox-esr91 --- wontfix
firefox99 --- wontfix
firefox100 --- wontfix
firefox101 --- wontfix
firefox102 --- wontfix

People

(Reporter: aaronmt, Unassigned)

References

Details

(Whiteboard: [geckoview:m101] [geckoview:m102] [geckoview:m103] [geckoview:m104] )

Details in https://github.com/mozilla-mobile/fenix/issues/24047

We have a branch open that is preparing to upgrade the AndroidX Test and Support libraries and device on Firebase to target API 30 across mobile projects. It is on this branch where we are frequently (daily at this point through trial runs) running into these ANRs (input dispatcher timeout). We would like to land this work to upgrade our tests to run on API30, but we're hesitant with the introduced instability. These ANRs occur across all mobile projects.

Input dispatcher State at time of last ANR:
  ANR:
    Time: 2022-03-15 09:48:37
    Reason: e35e897 org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity (server) is not responding. Waited 5003ms for FocusEvent(hasFocus=true)
    Window: ActivityRecord{f0b7ada u0 org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity t9} - Window{e35e897 u0 org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity}
  DispatchEnabled: true
  DispatchFrozen: false
  InputFilterEnabled: false
  FocusedDisplayId: 0
  FocusedApplications:
    displayId=0, name='ActivityRecord{f0b7ada u0 org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity t9}', dispatchingTimeout=5000ms
  FocusedWindows:
    displayId=0, name='Window{e35e897 u0 org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity}'
  TouchStates: <no displays touched>
  Display: 0
    Windows:
      0: name='Window{3c8501f u0 NavigationBar0}', displayId=0, portalToDisplayId=-1, paused=false, hasFocus=false, hasWallpaper=false, visible=true, canReceiveKeys=false, flags=0x21840068, type=0x000007e3, frame=[0,1788][1080,1920], globalScale=1.000000, windowScale=(1.000000,1.000000), touchableRegion=[0,1788][1080,1920], inputFeatures=0x00000000, ownerPid=1609, ownerUid=10133, dispatchingTimeout=5000ms
      1: name='Window{c2039a5 u0 StatusBar}', displayId=0, portalToDisplayId=-1, paused=false, hasFocus=false, hasWallpaper=false, visible=true, canReceiveKeys=false, flags=0x81800008, type=0x000007d0, frame=[0,0][1080,66], globalScale=1.000000, windowScale=(1.000000,1.000000), touchableRegion=[0,0][1080,66], inputFeatures=0x00000000, ownerPid=1609, ownerUid=10133, dispatchingTimeout=5000ms
      2: name='Window{e35e897 u0 org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity}', displayId=0, portalToDisplayId=-1, paused=false, hasFocus=true, hasWallpaper=false, visible=true, canReceiveKeys=true, flags=0x81010120, type=0x00000001, frame=[0,0][1080,1920], globalScale=1.000000, windowScale=(1.000000,1.000000), touchableRegion=[0,0][1080,1920], inputFeatures=0x00000000, ownerPid=12403, ownerUid=10153, dispatchingTimeout=5000ms
      3: name='SurfaceView - org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity#0', displayId=0, portalToDisplayId=-1, paused=false, hasFocus=false, hasWallpaper=false, visible=true, canReceiveKeys=false, flags=0x00000020, type=0x00000000, frame=[0,66][1080,1788], globalScale=1.000000, windowScale=(1.000000,1.000000), touchableRegion=<empty>, inputFeatures=0x00000002, ownerPid=12403, ownerUid=10153, dispatchingTimeout=0ms
      4: name='Window{5a9eb6b u0 com.android.systemui.ImageWallpaper}', displayId=0, portalToDisplayId=-1, paused=false, hasFocus=false, hasWallpaper=false, visible=false, canReceiveKeys=false, flags=0x00014318, type=0x000007dd, frame=[-54,-96][2509,2016], globalScale=1.000000, windowScale=(0.484979,0.484848), touchableRegion=[-54,-96][4750,3864], inputFeatures=0x00000000, ownerPid=1609, ownerUid=10133, dispatchingTimeout=5000ms
  Global monitors in display 0:
    0: 'PointerEventDispatcher0 (server)', 
  RecentQueue: length=10
    KeyEvent(deviceId=-1, source=0x00000101, displayId=-1, action=UP, flags=0x00000000, keyCode=4, scanCode=0, metaState=0x00000000, repeatCount=0), policyFlags=0x6b000000, age=11463ms
    FocusEvent(hasFocus=false), age=11347ms
    FocusEvent(hasFocus=true), age=11347ms
    MotionEvent(deviceId=-1, source=0x00001002, displayId=0, action=DOWN, actionButton=0x00000000, flags=0x00000000, metaState=0x00000000, buttonState=0x00000000, classification=NONE, edgeFlags=0x00000000, xPrecision=1.0, yPrecision=1.0, xCursorPosition=nan, yCursorPosition=nan, pointers=[0: (915.0, 1711.0)]), policyFlags=0x6b000000, age=10462ms
    MotionEvent(deviceId=-1, source=0x00001002, displayId=0, action=UP, actionButton=0x00000000, flags=0x00000000, metaState=0x00000000, buttonState=0x00000000, classification=NONE, edgeFlags=0x00000000, xPrecision=1.0, yPrecision=1.0, xCursorPosition=nan, yCursorPosition=nan, pointers=[0: (915.0, 1711.0)]), policyFlags=0x6b000000, age=10462ms
    FocusEvent(hasFocus=false), age=10168ms
    FocusEvent(hasFocus=true), age=9000ms
    MotionEvent(deviceId=-1, source=0x00001002, displayId=0, action=DOWN, actionButton=0x00000000, flags=0x00000000, metaState=0x00000000, buttonState=0x00000000, classification=NONE, edgeFlags=0x00000000, xPrecision=1.0, yPrecision=1.0, xCursorPosition=nan, yCursorPosition=nan, pointers=[0: (485.0, 991.0)]), policyFlags=0x6b000000, age=6064ms
    MotionEvent(deviceId=-1, source=0x00001002, displayId=0, action=UP, actionButton=0x00000000, flags=0x00000000, metaState=0x00000000, buttonState=0x00000000, classification=NONE, edgeFlags=0x00000000, xPrecision=1.0, yPrecision=1.0, xCursorPosition=nan, yCursorPosition=nan, pointers=[0: (485.0, 991.0)]), policyFlags=0x6b000000, age=5847ms
    FocusEvent(hasFocus=true), age=5033ms
  PendingEvent: <none>
  InboundQueue: <empty>
  ReplacedKeys: <empty>
  Connections:
    346: channelName='e35e897 org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity (server)', windowName='e35e897 org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity (server)', status=NORMAL, monitor=false, responsive=false
      OutboundQueue: <empty>
      WaitQueue: length=1
        FocusEvent(hasFocus=true), targetFlags=0x00000100, resolvedAction=0, age=5033ms, wait=5003ms
    317: channelName='dd0c025 com.google.android.apps.mtaas.backdrop/com.google.android.apps.mtaas.backdrop.BackdropActivity (server)', windowName='dd0c025 com.google.android.apps.mtaas.backdrop/com.google.android.apps.mtaas.backdrop.BackdropActivity (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    262: channelName='484a672 AssistPreviewPanel (server)', windowName='484a672 AssistPreviewPanel (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    331: channelName='fbd63e0 com.android.launcher3/com.android.searchlauncher.SearchLauncher (server)', windowName='fbd63e0 com.android.launcher3/com.android.searchlauncher.SearchLauncher (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    302: channelName='recents_animation_input_consumer (server)', windowName='recents_animation_input_consumer (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    335: channelName='3e98489 InputMethod (server)', windowName='3e98489 InputMethod (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    311: channelName='f2aeb25 pip-dismiss-overlay (server)', windowName='f2aeb25 pip-dismiss-overlay (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    267: channelName='c2039a5 StatusBar (server)', windowName='c2039a5 StatusBar (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    310: channelName='5a9eb6b com.android.systemui.ImageWallpaper (server)', windowName='5a9eb6b com.android.systemui.ImageWallpaper (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    343: channelName='c56a870 NotificationShade (server)', windowName='c56a870 NotificationShade (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    237: channelName='24b5f2c com.android.launcher3/com.android.searchlauncher.SearchLauncher (server)', windowName='24b5f2c com.android.launcher3/com.android.searchlauncher.SearchLauncher (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    281: channelName='3c8501f NavigationBar0 (server)', windowName='3c8501f NavigationBar0 (server)', status=NORMAL, monitor=false, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
    140: channelName='PointerEventDispatcher0 (server)', windowName='PointerEventDispatcher0 (server)', status=NORMAL, monitor=true, responsive=true
      OutboundQueue: <empty>
      WaitQueue: <empty>
  AppSwitch: not pending
  Configuration:
    KeyRepeatDelay: 50ms
    KeyRepeatTimeout: 400ms

https://github.com/mozilla-mobile/fenix/issues/24047#issuecomment-1069667056

From Mike:

The ANR seems to be caused by the main thread waiting for a art::ConditionVariable::WaitHoldingLocks(art::Thread*)+148) during on a vsync:

"main" prio=5 tid=1 Native

  #00  pc 0000000000086b4c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
  #00  pc 00000000001af92c  /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+148)
  #00  pc 00000000003a0408  /apex/com.android.art/lib64/libart.so (art::JNI<false>::CallObjectMethodV(_JNIEnv*, _jobject*, _jmethodID*, std::__va_list)+504)
  #00  pc 0000000000004040  /apex/com.android.art/lib64/libnativehelper.so (_JNIEnv::CallObjectMethod(_jobject*, _jmethodID*, ...)+124)
  #00  pc 00000000000ed10c  /system/lib64/libandroid_runtime.so (android::NativeDisplayEventReceiver::dispatchVsync(long, unsigned long, unsigned int)+48)
  #00  pc 000000000008fae8  /system/lib64/libgui.so (android::DisplayEventDispatcher::handleEvent(int, int, void*)+168)
  #00  pc 0000000000019da8  /system/lib64/libutils.so (android::Looper::pollInner(int)+916)
  #00  pc 00000000000199ac  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+112)
  #00  pc 0000000000118288  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
  at android.os.MessageQueue.nativePollOnce (Native method)
  at android.os.MessageQueue.next (MessageQueue.java:335)
  at android.os.Looper.loop (Looper.java:193)
  at android.app.ActivityThread.main (ActivityThread.java:8059)
  at java.lang.reflect.Method.invoke (Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run (RuntimeInit.java:656)
  at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:967)

Adding this bug to GV's m101 sprint because Aaron said this crash is common in Google Play's crash stats.

Tentatively assigning to Agi because he said he would investigate the crash this week.

Assignee: nobody → agi
Severity: -- → S2
Priority: -- → P1
Whiteboard: [geckoview:m101]
Assignee: agi → jonalmeida942

Agi says this is a GV bug but the Graphics team is also looking at it. He recommends we check with them.

Whiteboard: [geckoview:m101] → [geckoview:m101] [geckoview:m102]

noted in android eng-call happy to run any signed build with any sort of additional GV logging to LOGCAT on Firebase Test Lab against the tests

No sure, If I managed to reproduce this manually on a Pixel 3a (Android 11) but I've encountered a crash #25102 on today's Nightly build that has a similar STR with two automated UI tests affected by the ANR's #24978

Christian you had mentioned builds with logging that we can run on Firebase Test Lab as an option to help facilitate this bug, are those available?

Flags: needinfo?(csadilek)

(In reply to Aaron Train [:aaronmt] from comment #6)

Christian you had mentioned builds with logging that we can run on Firebase Test Lab as an option to help facilitate this bug, are those available?

Aaron, I'm looking into it this week, so I'll need a bit more time to get you a build.

Flags: needinfo?(csadilek)

I've built a debug GV locally, packaged it within a Fenix build that was signed with my debug keys, and placed a few variants here: https://drive.google.com/drive/folders/10ZzQ7dK1gewv8hAdllaJFTs2hEKE3-gN?usp=sharing

Could you let me know if you have trouble installing them in case the debug key signing process doesn't work on Firebase?

Flags: needinfo?(aaron.train)

So here are some updates on what Christian and I discovered last week:

  1. We can see these ANRs happening across all our browsers (fenix/focus/reference browser) which means that there is no application-specific code that could be the cause.
  2. When we looked at a failing firebase video, you can see the page loading, some seconds go by and app closes (the video is shortened so it looks quicker than it is).
  3. There is a bugreport.txt that has some trace files that happened during the ANR, but those do not show anything happening on the main thread that looks suspicious; no references to our code running at the point of the ANR.
  4. The dump of the graphics stats for our app shows: Total frames rendered: 25 Janky frames: 24 (96.00%) So the app is laggy.
  5. Could this be the app or is the virtual emulator too slow for our debug app?
  6. When you look at the logcat, we can see GeckoView logs show the test page loading, completing, and idling; less reason to suspect GV at this point.
    • The logs are also very generic so the Google Play reports that indicate the same crash might be the cause of something else.
  7. There is a suspicious line in the Firebase report: Test instrumentation process crashed. Check org.mozilla.focus.activity.ErrorPagesTest#badURLCheckTest.txt for details
    • We can't get these logs off Firebase for this particular text file but there is a request for the Firebase team to do this eventually.
  8. Some google searches indicate that people have had these instrumentation process crashes happen because of mismatched dependencies we found an alpha version of a test dep that we should update to a stable as well.

Some next steps to try:

  1. There is some reason to suspect that if we can upgrade to a matching version of test frameworks needed, these ANRs may go away.
  2. Try a faster emulator or maybe just a different one entirely.
  3. Profile the app while closing tabs to see if we're causing some kind of memory/CPU spike which is slowing the emulator.
  4. Try and deduce the location in code where this is happening and print the stack traces out on a loop to see what the main thread is doing.
  5. If all else fails, we can provide a debug GV within a debug app to aaronmt and have him reproduce the bug on Firebase again to see if that bugreport gives us more information.

Clearing NI for now until I investigate what would be the right dependencies to match up before retrying.

Flags: needinfo?(aaron.train)

Jonathan will continue investigation in 103. If this is a bug in the test or test framework, we don't need to uplift to Beta 102.

Whiteboard: [geckoview:m101] [geckoview:m102] → [geckoview:m101] [geckoview:m102] [geckoview:m103]

Jon says this is a Fenix issue, not a GV issue. He'll reopen the Fenix bug and close this one.

Whiteboard: [geckoview:m101] [geckoview:m102] [geckoview:m103] → [geckoview:m101] [geckoview:m102] [geckoview:m103] [geckoview:m104]

Moving some input bugs to the new GeckoView::IME component.

Component: General → IME

105

Whiteboard: [geckoview:m101] [geckoview:m102] [geckoview:m103] [geckoview:m104] → [geckoview:m101] [geckoview:m102] [geckoview:m103] [geckoview:m104] [geckoview:m105]

Not actionable atm. Aaron is reaching out to Firebase.

Priority: P1 → P2
Whiteboard: [geckoview:m101] [geckoview:m102] [geckoview:m103] [geckoview:m104] [geckoview:m105] → [geckoview:m101] [geckoview:m102] [geckoview:m103] [geckoview:m104]

Ruling out GMSCore (FLOSS Play Services) updates on FTL as a suspect. FTL is currently rolling out GMSCore updates across devices (you can tell from the first frame in videos as the Play Services version), and I have an instance of the same (or similar signature) ANR

08-13 20:11:44.907: I/WindowManager(483): Input event dispatching timed out sending to org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity. Reason: 240a4be org.mozilla.fenix.debug/org.mozilla.fenix.HomeActivity (server) is not responding. Waited 5003ms for MotionEvent(deviceId=-1, source=0x00001002, displayId=0, action=MOVE, actionButton=0x00000000, flags=0x00000000, metaState=0x00000000, buttonState=0x00000000, classification=NONE, edgeFlags=0x00000000, xPrecision=1.0, yPrecision=1.0, xCursorPosition=nan, yCursorPosition=nan, pointers=[0: (915.0, 1711.0)]), policyFlags=0x6b000000 @ matrix summary details

With the next two lines as

08-13 20:11:45.022: I/ActivityManager(483): Force stopping org.mozilla.fenix.debug appid=10152 user=0: finished inst 08-13 20:11:45.022: I/ActivityManager(483): Killing 15886:org.mozilla.fenix.debug/u0a152 (adj 0): stop org.mozilla.fenix.debug due to finished inst

Would also like to rule out any warnings of Non-SDK API Usage Violations that FTL identifies (see matrix link violations here) (although no mention of Landroid/view/MotionEvent)

We have FocusEvent and MotionEvent instances at this point

Of note are the slow UI thread and (~100%) slow draw commands reported in the performance tab for this matrix

This issue is logged on Google's bug tracker here: https://issuetracker.google.com/issues/230950647

Un-assigning since there is nothing actionable for us to do here and an upstream bug has been filed.

Assignee: jonalmeida942 → nobody

This comment looks interesting https://issuetracker.google.com/issues/230950647#comment137

My understanding of Google's position on this issue, based on https://issuetracker.google.com/issues/230950647#comment51 and https://issuetracker.google.com/issues/230950647#comment53 is that there isn't an Android bug that's causing these types of ANRs, it's just that they've made a change which means that most ANRs are now being reporting with a callstack that tells you no useful information.

If so, it's not really the case that Google can go and fix these issues for you*, but that they've just created a situation where it's nigh on impossible to diagnose most ANRs. I think the best we can do as developers is look at the few actionable ANRs and hope that by fixing those it might also address some of the ones that come through with this useless callstack.

You need to log in before you can comment on or make changes to this bug.