Closed Bug 1081768 Opened 5 years ago Closed Last year

Crash with java.lang.NullPointerException @ org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525) when BrowserApp is immediately displaced as foreground activity

Categories

(Firefox for Android :: General, defect, blocker)

All
Android
defect
Not set
blocker

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox35 - affected
fennec + ---

People

(Reporter: capella, Unassigned)

References

Details

(Keywords: crash, regression, topcrash-android-armv7, Whiteboard: [has bitrotted patches])

Crash Data

Attachments

(2 files, 2 obsolete files)

After a fresh install if I wait while FF opens on the new "Welcome" dialog, I get a NPE crash ... 
https://pastebin.mozilla.org/6773505

Points basically to java.lang.NullPointerException @ org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525)

If after a fresh install I simply browse really fracking quickly ( ;-) )as suggested by rnewman, all is well.

He suggests: |the firstrun experience is preventing onWindowFocusChanged from being called on BrowserApp, and so it's not performing the init that it needs to.|

And that: |the solution is probably to fix that lifecycle stuff to not rely on oWFC.|
My crash on idle on the welcome screen has been happening for a while now [1] see bug 1077858.

[1] [@ java.lang.NullPointerException: at org.mozilla.gecko.db.BrowserDB.getCount(BrowserDB.java)]
Yup, they're related -- see my analysis in Bug 1077858 Comment 1.
Blocks: firstrun
Severity: normal → blocker
tracking-fennec: --- → ?
Crash Signature: [@ java.lang.NullPointerException: at org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525)]
Keywords: crash
Hardware: ARM → All
See Also: → 1077858
Summary: Crash on fresh nightly install ... java.lang.NullPointerException ... → Crash with java.lang.NullPointerException @ org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525)
Duplicate of this bug: 1078304
Regression from bug 888482 based on the comments in bug 1078304 which was duped to this.
Blocks: 888482
Flags: needinfo?(nchen)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #4)
> Regression from bug 888482 based on the comments in bug 1078304 which was
> duped to this.

I don't believe that to be the case; it's not about GeckoThread, it's about GeckoApp.onWindowFocusChanged not being called thanks to the first run experience.

If you eliminate the first run experience, the crash goes away.
No, they're the same:

E/GeckoLayerView( 5618): java.lang.NullPointerException
E/GeckoLayerView( 5618):        at org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525)
E/GeckoLayerView( 5618):        at dalvik.system.NativeStart.run(Native Method)
E/GeckoLayerView( 5618):        at dalvik.system.NativeStart.run(Native Method)

but my argument is this:

The NPE is here in registerCxxCompositor:

       LayerView layerView = GeckoAppShell.getLayerView();
       GLController controller = layerView.getGLController();   <<<<

that means GeckoAppShell doesn't have a LayerView yet.

GAS.sLayerView is set in setLayerView, which in our case is called by GeckoApp.initializeChrome, which is called from BrowserApp's inherited initializeChrome and GeckoApp.initialize.

GeckoApp.initialize is called only from GeckoApp.onWindowFocusChanged.

Ergo, this NPE is a direct consequence of onWindowFocusChanged not being called before LayerView.registerCxxCompositor.

My hypothesis -- which capella verified in Comment 0 by tapping through quickly -- is that oWFC isn't being called until the FRE is dismissed.

Yes, there's still a race between UI code and the Gecko thread calling via JNI, but that race is non-existent in practice.

There is only one real fix for this bug: don't do this kind of work (or the DB setting in Bug 1077858) in oWFC, because it's the wrong place in the lifecycle.
Hm, ok. Thanks for the detailed explanation.
#1 on 35
tracking-fennec: ? → +
I don't think we can set-aside a top-crash.
tracking-fennec: + → ?
Chenxia is going to work around this in Bug 1072831, which is tracking 34.

I'll take this and do some lifecycle analysis.
Assignee: nobody → rnewman
Status: NEW → ASSIGNED
tracking-fennec: ? → 35+
Depends on: 1072831
Looks like the problem is due to LayerView.registerCxxCompositor being called on the compositor thread, which does not wait for the UI to initialize on the main thread.

I think a quick bandaid would be to move the GeckoApp.intiializeChrome() call to GeckoApp.onCreate().

To actually avoid the race though, we should probably make GeckoAppShell.getLayerView() wait on the LayerView to become available.
Flags: needinfo?(nchen)
It's possible that the compositor thread's access of sLayerView occurs before it's been set by an application activity.

This patch allows for a certain amount of overlap, and lays the groundwork for different access patterns.

Note that this doesn't fix the bug in all cases -- if a user waits on the first run activity (and thus onWindowFocusChange isn't called) for more than 10 seconds + the amount of time it takes for the compositor thread to start, then they'll get an IllegalStateException.

The intention in this patch is not to paper over that issue.
This patch builds upon Part 1 by making registerCxxCompositor block forever (until interrupted) for the LayerView to become available.

This *does* paper over the user-visible bug.
Part 3 coming to address the lifecycle change. Might bump it to a different bug.
Comment on attachment 8508150 [details] [diff] [review]
Part 1: wait for LayerView to become available in registerCxxCompositor

I've verified that this builds and regular Fennec startups work fine.
Attachment #8508150 - Flags: review?(nchen)
Attachment #8508151 - Flags: review?(nchen)
Depends on: 1085591
Comment on attachment 8508150 [details] [diff] [review]
Part 1: wait for LayerView to become available in registerCxxCompositor

Review of attachment 8508150 [details] [diff] [review]:
-----------------------------------------------------------------

We should be careful about making getLayerView wait. setLayerView is called on the UI thread, and if something calls getLayerView on the UI thread before setLayerView, it's going to have a bad time. How about we have a separate method, say waitForLayerView, that specifically waits for the layer view to become available? In that case, only LayerView.registerCxxCompositor would call waitForLayerView.

::: mobile/android/base/GeckoAppShell.java
@@ -97,5 @@
>  import android.media.MediaScannerConnection.MediaScannerConnectionClient;
>  import android.net.ConnectivityManager;
>  import android.net.NetworkInfo;
>  import android.net.Uri;
> -import android.os.Bundle;

GeckoAppShell is using Bundle now after bug 1043457 landed.

@@ +313,5 @@
> +                    return;
> +                }
> +                sLayerView = lv;
> +            } finally {
> +                if (lv != null) {

I don't think lv can ever be null?

@@ +337,5 @@
> +                return sLayerView;
> +            }
> +
> +            try {
> +                sLayerViewMonitor.wait(LAYER_VIEW_AVAILABILITY_MSEC);

Note that the wait time is not guaranteed because of spurious wakeups.
Attachment #8508150 - Flags: review?(nchen) → feedback+
(In reply to Jim Chen [:jchen] from comment #17)
> In that case, only LayerView.registerCxxCompositor
> would call waitForLayerView.

Fair point.


> GeckoAppShell is using Bundle now after bug 1043457 landed.

Yeah, I got a conflict just now when re-applying.


> I don't think lv can ever be null?

Why not? It's the argument. It's *not* null now, but it could be.

The point of that check is just to make sure we don't notify a waiter for a null value.


> Note that the wait time is not guaranteed because of spurious wakeups.

Yup, see the comment that snuck into Part 2.
Adds a second code path to wait, but synchronizes the current getLayerView.
Attachment #8508301 - Flags: review?(nchen)
Attachment #8508150 - Attachment is obsolete: true
Attachment #8508151 - Attachment is obsolete: true
Attachment #8508151 - Flags: review?(nchen)
Attachment #8508301 - Flags: review?(nchen) → review+
Comment on attachment 8508303 [details] [diff] [review]
Part 2: make JNI calls to retrieve the layer block for longer. v2

Review of attachment 8508303 [details] [diff] [review]:
-----------------------------------------------------------------

When waiting indefinitely, we should check that | sLayerView != null | before returning, and wait again otherwise.

Actually, I'd rather not have a timeout at all (i.e. waitForLayerView always waits indefinitely). That should simplify things and you can put the .wait() call inside a loop that checks for | sLayerView != null |. Using a loop also guards against spurious wakeups, which are unpredictable. If waiting indefinitely causes a hang, something is definitely broken, and our hang detection is robust enough to let us know.

That's just my 2 cents; it's your call.
Attachment #8508303 - Flags: review?(nchen) → review+
(In reply to Jim Chen [:jchen] from comment #21)

> That's just my 2 cents; it's your call.

Thanks for the input. I'll do some testing with infinite looping, see if anything seems bothersome.
I have this staged to land with infinite looping. Note that it will hang on startup without Bug 1085591, so gotta wait for review for that before I pull the trigger.
This looks healthy on Try apart from reftests, which seem to be doing something silly:

[Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIDOMWindowUtils.layerManagerType]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: chrome://reftest/content/reftest.jsm :: BuildConditionSandbox :: line 608" data: no]
Adding the signature from bug 1078304 since it was duped here.
Crash Signature: [@ java.lang.NullPointerException: at org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525)] → [@ java.lang.NullPointerException: at org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525)] [@ mozilla::widget::android::GLController::CreateEGLSurfaceForCompositorWrapper()]
Is this fixed via bug 1072831?
It'll be worked around by that bug. It'll be solved as part of Bug 1085591.
interim, can confirm no more crashing on start pane idle on trunk 10/29
Gently morphing this.
tracking-fennec: 35+ → 36+
Summary: Crash with java.lang.NullPointerException @ org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525) → Crash with java.lang.NullPointerException @ org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525) when BrowserApp is immediately displaced as foreground activity
Blocks: 1081052
Blocks: 1104692
Blocks: 1104747
Let's reconsider this in triage. It's something I want to fix, but we worked around it for now, and I don't think I have time to do it in 36.
tracking-fennec: 36+ → ?
tracking-fennec: ? → +
[Tracking Requested - why for this release]: this is the #1 topcrash in Fennec 35 -- it's also a startup crash and is likely preventing people from using Beta (if not driving users away altogether).

Fennec 35b2: 441 / 2524 crashes
Fennec 35b1: 1454 / 9257 crashes
Combined: 1895 / 11781 (16%)

I think this needs to be prioritized if it's not already being worked on.
How is this being worked around?  The crash volume is lower from b1 to b2 so is that because of workaround or because we've lost people?
Flags: needinfo?(rnewman)
Flags: needinfo?(anthony.s.hughes)
(In reply to Lukas Blakk [:lsblakk] use ?needinfo from comment #33)
> How is this being worked around?  The crash volume is lower from b1 to b2 so
> is that because of workaround or because we've lost people?

When I had posted comment 32 we had more data on 35b1 because we had only just released 35b2.

Here is the current breakdown by product for the last 7 days:
> FennecAndroid 	35.0b2 	59.37%	1533 crashes
> FennecAndroid 	35.0b1 	26.22%	677 crashes
> FennecAndroid 	35.0b4 	8.29%	214 crashes
> FennecAndroid 	36.0a2 	3.56%	92 crashes
> FennecAndroid 	37.0a1 	2.13%	55 crashes
> FennecAndroid 	36.0a1 	0.23%	6 crashes
> FennecAndroid 	35.0a2 	0.19%	5 crashes

We currently sit at 62717 ADIs across all Fennec Beta 35 versions, up from 57572 on December 15th. Crash rate is still significant but we're probably not losing a lot of people because of it.
Flags: needinfo?(anthony.s.hughes)
The workaround is that we no longer provoke this bug from the first-run experience, so we're back to the background hum of crashes.

Remaining crashes are likely to be due to app switching/notifications/etc -- ordinary Android lifecycle stages that we don't handle correctly.

This bug will suddenly become high-priority again as soon as we want to launch an activity within a minute of launching Firefox.
Flags: needinfo?(rnewman)
Given that, not tracking for 35 - we're not likely to cram in a fix for this at this stage.  Since it's tracking-fennec ? it appears that this will stay on mobile team's radar and get prioritized later.
Assignee: rnewman → nobody
Status: ASSIGNED → NEW
Whiteboard: [has bitrotted patches]
Crash Signature: [@ java.lang.NullPointerException: at org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525)] [@ mozilla::widget::android::GLController::CreateEGLSurfaceForCompositorWrapper()] → [@ java.lang.NullPointerException: at org.mozilla.gecko.gfx.LayerView.registerCxxCompositor(LayerView.java:525)] [@ mozilla::widget::android::GLController::CreateEGLSurfaceForCompositorWrapper()] [@ java.lang.NullPointerException: at org.mozilla.gecko.g…
Closing because no crash reported since 12 weeks.
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.