Closed Bug 1799996 Opened 2 years ago Closed 1 year ago

org.webrtc.SurfaceEglRenderer crash in [@ android.view.WindowManager$BadTokenException: at android.view.ViewRootImpl.setView(ViewRootImpl.java)]

Categories

(Fenix :: Browser Engine, defect, P1)

All
Android

Tracking

(firefox106 wontfix, firefox107 wontfix, firefox108 verified, firefox109+ verified)

VERIFIED FIXED
109 Branch
Tracking Status
firefox106 --- wontfix
firefox107 --- wontfix
firefox108 --- verified
firefox109 + verified

People

(Reporter: cpeterson, Assigned: jonalmeida)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [geckoview:m109][geckoview:m110][fxdroid])

Crash Data

Attachments

(2 files)

+++ This bug was initially created as a clone of Bug #1536505 +++

[Tracking Requested - why for this release]:

This top crash is a regression in Fenix 106, probably related to WebRTC bug 1766646.

https://crash-stats.mozilla.org/report/index/ca003154-1b81-46f0-b1c2-0959c0221102

android.view.WindowManager$BadTokenException: Unable to add window -- token null is not valid; is your activity running?
	at android.view.ViewRootImpl.setView(ViewRootImpl.java:958)
	at android.view.WindowManagerGlobal.addView(WindowManagerGlobal.java:381)
	at android.view.WindowManagerImpl.addView(WindowManagerImpl.java:100)
	at org.webrtc.SurfaceEglRenderer$$ExternalSyntheticLambda0.run(Unknown Source:15)
	at android.os.Handler.handleCallback(Handler.java:789)
	at android.os.Handler.dispatchMessage(Handler.java:98)
	at android.os.Looper.loop(Looper.java:164)
	at android.app.ActivityThread.main(ActivityThread.java:6944)
	at java.lang.reflect.Method.invoke(Native Method)
	at com.android.internal.os.Zygote$MethodAndArgsCaller.run(Zygote.java:327)
	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1374)

Some notes from bug 1536505 comment 23 and 24:

the 106 release included a libwebrtc update which brought in over a year's worth of improvements authored by Google. Some of this work was performance related. So it seems reasonable to guess here that if we have a race related to prompt display early in startup in geckoview, improvements to the lib would tickle that issue more often.

The only possibility we can come up with here is some sort of camera permissions prompt issue on a restart of the browser. (Uptimes are mostly 0 seconds.) Testing that scenario however indicates the prompts work correctly. We don't have urls in these reports, so we don't know what service they might be associated with.

I think it would be useful to get the geckoview teams thoughts on the error here - android.view.WindowManager$BadTokenException: Unable to add window -- token null is not valid; is your activity running? at android.view.ViewRootImpl.setView(ViewRootImpl.java:1002)

What might cause a null window token early in startup on a prompt?

Severity: -- → S2
Priority: -- → P2
Component: General → Browser Engine
Hardware: ARM → All
Whiteboard: [geckoview:m109?]

Set release status flags based on info from the regressing bug 1766646

:mjf, since you are the author of the regressor, bug 1766646, could you take a look?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mfroman)

Clearing the needinfo for mjf because we need an Android engineer to investigate first.

Flags: needinfo?(mfroman)

The bug is marked as tracked for firefox108 (nightly). We have limited time to fix this, the soft freeze is today. However, the bug still isn't assigned.

:cpeterson, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(cpeterson)

Tracking for 107, not new in the 107 release and 107 is in RC
Keeping it on the radar for a potential uplift and 107 dot release ride-along.

This crash is reported on all Android OS versions that Fenix supports (API 21-33), so it's not related to recent Google changes to Android permissions.

Flags: needinfo?(cpeterson)
Priority: P2 → P1
Whiteboard: [geckoview:m109?] → [geckoview:m109]

Set release status flags based on info from the regressing bug 1766646

Chris, has an android engineer investigated this?

I talked with Bryan Clark the engineering manager for the foundation team. Overall the crash rate for 106 was not adversely affected by this crash, this does not appear to be critical enough to interrupt sprint plans. A member of his team will look at this with a time box approach. If they are not able to provide a fix we will evaluate this for the 110 development cycle. Getting community STR would be another thing that could simplify finding a fix.

This is the #4 overall topcrash for Fenix 107.1.0. IMO this needs higher priority.

I don't understand why this was split out from bug 1536505. Web Conferencing Team feels this is caused by some sort of race in prompts vs. something that changed in webrtc.

Whiteboard: [geckoview:m109] → [geckoview:m109] [fxdroid]

(In reply to Jim Mathies [:jimm] from comment #10)

I don't understand why this was split out from bug 1536505. Web Conferencing Team feels this is caused by some sort of race in prompts vs. something that changed in webrtc.

That bug was an old, low-volume crash signature. Kevin recommended that we spin off the investigation into this specific crash spike into a new bug. If the fix for this crash spike fixes all the crashes with this signature, then we can close that bug, too.

This bug is in the Android Foundation team's current sprint.

Assigning to investigate.

Assignee: nobody → jonalmeida942

Removing from 107 tracking. Sounds like this may take some time to diagnose.

[Tracking Requested - why for this release]:
moving tracking forward.

Any progress here?

Flags: needinfo?(jonalmeida942)

(In reply to Jim Mathies [:jimm] from comment #15)

Any progress here?

Nothing conclusive so far.

The breadcrumbs indicate that we are trying to attach a view after it has been deatached (onDetach). From looking at the stack traces, I see other reports that show a trace from different entry points:

Link: https://crash-stats.mozilla.org/report/index/89ea700f-7272-4c2e-ac44-b88270221128

android.view.WindowManager$BadTokenException: Unable to add window -- token android.os.BinderProxy@fdb47ef is not valid; is your activity running?
	at android.view.ViewRootImpl.setView(ViewRootImpl.java:1147)
	at android.view.WindowManagerGlobal.addView(WindowManagerGlobal.java:471)
	at android.view.WindowManagerImpl.addView(WindowManagerImpl.java:95)
	at android.app.Dialog.show(Dialog.java:507)
	at androidx.fragment.app.DialogFragment.onStart(DialogFragment.java:11)
	at org.mozilla.fenix.search.SearchDialogFragment.onStart(SearchDialogFragment.kt:1)
	at androidx.fragment.app.FragmentStateManager.start(FragmentStateManager.java:32)
	at androidx.fragment.app.FragmentStateManager.moveToExpectedState(FragmentStateManager.java:55)
	at androidx.fragment.app.FragmentManager.executeOpsTogether(FragmentManager.java:1178)
	at androidx.fragment.app.FragmentManager.removeRedundantOperationsAndExecute(FragmentManager.java:92)
	at androidx.fragment.app.FragmentManager.execPendingActions(FragmentManager.java:74)
	at androidx.fragment.app.FragmentManager$5.run(FragmentManager.java:4)
	at android.os.Handler.handleCallback(Handler.java:883)
	at android.os.Handler.dispatchMessage(Handler.java:100)
	at android.os.Looper.loop(Looper.java:237)
	at android.app.ActivityThread.main(ActivityThread.java:8167)
	at java.lang.reflect.Method.invoke(Native Method)
	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:496)
	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1100)

Link: https://crash-stats.mozilla.org/report/index/936df649-b64f-4af9-afaa-cecd70221128

android.view.WindowManager$BadTokenException: Unable to add window -- token null is not valid; is your activity running?
	at android.view.ViewRootImpl.setView(ViewRootImpl.java:598)
	at android.view.WindowManagerGlobal.addView(WindowManagerGlobal.java:341)
	at android.view.WindowManagerImpl.addView(WindowManagerImpl.java:85)
	at org.mozilla.fenix.compose.cfr.CFRPopup$$ExternalSyntheticLambda0.run(R8$$SyntheticClass:56)
	at android.os.Handler.handleCallback(Handler.java:743)
	at android.os.Handler.dispatchMessage(Handler.java:95)
	at android.os.Looper.loop(Looper.java:150)
	at android.app.ActivityThread.main(ActivityThread.java:5546)
	at java.lang.reflect.Method.invoke(Native Method)
	at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:792)
	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:682)

This leads me to believe that it's more a timing issue with how we decide to add a view in some general sense and not specific to the org.webrtc.SurfaceEglRenderer.

I'm now trying to see what might have landed in Fenix that might be trying to show Android Views after our Fragment was detached.

Flags: needinfo?(jonalmeida942)

I looked into the history apply_observation breadcrumbs, however they don't seem to be related - history events are being recorded here while browsing.

In some of the traces, the signature is from org.mozilla.fenix.compose.cfr.CFRPopup which is the non-upstreamed version. The traces seems to be happening specifically at BrowserToolbarCFRPresenter here.

This change was added in fenix#26226 which landed in 106 as well, so looks highly suspicious as the source of where the increase happens.

Current hypothesis: What I believe is happening, is that we are observing page loading until 99% completion until we decide to show a CFR popup here - this is happening off the main thread. When we get to the point to show this to the user here, the method CFRPopup#show then posts to the UI handler at the back of the queue on the main thread to add the popup view to the anchor view on the WindowManager. So there are two chained async tasks that are happening when the anchor view may have already been removed from the window before we actually to get to our addView call (which triggers the ViewRootImpl.setView).

What I'm trying to do now is reproduce this bug and find a solution for it.

A speculative fix has been put up for review. When it lands, we will observe the crash rates in nightly before considering an uplift.

(In reply to Jonathan Almeida [:jonalmeida] from comment #19)

Created attachment 9306798 [details] [review]
Bug 1799996 - Speculative fix checks CFRPopup.anchor is attached to window #28063

A speculative fix has been put up for review. When it lands, we will observe the crash rates in nightly before considering an uplift.

Landed in https://github.com/mozilla-mobile/fenix/commit/7a8b0a91c0657340c6a5d191099c22644c49c80a

Whiteboard: [geckoview:m109] [fxdroid] → [geckoview:m109][geckoview:m110][fxdroid]

No crashes on Nightly since this landed when we were previously seeing pretty consistent crash volume in every build. Looks like the fix worked! BTW, the fix does cherry-pick cleanly to v108 as well. IMO, we should definitely keep this on the radar for eventual dot release inclusion should the opportunity arise.

Status: NEW → RESOLVED
Closed: 1 year ago
No longer regressed by: libwebrtc-fast-forward
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch

(In reply to Ryan VanderMeulen [:RyanVM] from comment #21)

No crashes on Nightly since this landed when we were previously seeing pretty consistent crash volume in every build. Looks like the fix worked! BTW, the fix does cherry-pick cleanly to v108 as well. IMO, we should definitely keep this on the radar for eventual dot release inclusion should the opportunity arise.

Thanks for checking - I can prep this for the 108 release branch in that case.

Uplifted, but the fix has not been released yet (no scheduled patch releases): https://github.com/mozilla-mobile/fenix/commit/c84d0805ec7d149c7f8fc7b7ee5a322ba685b976

We couldn't reproduce this issue on both Focus and Fenix with the latest builds ( RC108.1.1 build2 and 109.0b3 ).

Devices used:

  • Huawei P40 Lite (Android 10).
  • Lenovo Tab M10 (Android 10).
  • Oppo Find X5 (Android 12).
  • Samsung Galaxy S22 Ultra (Android 13).
  • Google Pixel 6 (Android 13).

Marking this ticket as verified.

Status: RESOLVED → VERIFIED
See Also: → 1824648
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: