Closed Bug 1382335 Opened 2 years ago Closed 2 years ago

Crash in java.lang.NullPointerException: Null native pointer at org.mozilla.gecko.mozglue.GeckoLoader.nativeRun(Native Method), mainly about nsWindow::LayerViewSupport

Categories

(Firefox for Android :: General, defect, critical)

Unspecified
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
firefox55 --- wontfix
firefox56 --- fixed
firefox57 --- fixed

People

(Reporter: jchen, Assigned: jchen)

References

Details

(Keywords: crash)

Crash Data

Attachments

(5 files)

+++ This bug was initially created as a clone of Bug #1372777 +++

Taking a closer look at the crash and see if there is an alternate fix.
Right now we throw NullPointerException with the message "Null native
pointer", but we can save the type of the native pointer in the message
to make it easier to categorize and analyze crashes. This patch uses the
__func__ feature to get the name of a particular template parameter
type.
Attachment #8888539 - Flags: review?(esawin)
Introduce the crash again to get more crash reports for analysis.
Attachment #8888540 - Flags: review?(esawin)
Attachment #8888539 - Flags: review?(esawin) → review+
Comment on attachment 8888540 [details] [diff] [review]
2. Revert workaround from bug 1372777 (v1)

Review of attachment 8888540 [details] [diff] [review]:
-----------------------------------------------------------------

If I'm reading it correctly, your first patch will produce more useful crash signatures, but otherwise the same crash stacks.
We have enough crash stacks for analysis for bug 1372777, I'm not sure how we would benefit from re-enabling the top crasher.
The fix in bug 1372777 didn't actually get rid of the crash; it just moved the crash to a different signature. That's why in bug 1380134, we started seeing the same crash in mozilla::widget::GeckoEditableSupport, but with this patch it moved back to mozilla::jni::detail::ProxyNativeCall. I want to revert bug 1380134 to get more crash reports with the original (and correct) stack.
Attachment #8888540 - Flags: review?(esawin) → review+
There's some racing going on between compositor methods that use the
XPCOM queue and the disposeNative method that uses the priority queue.
Move everything to the XPCOM queue to fix this condition.
Attachment #8889976 - Flags: review?(esawin)
Attachment #8889976 - Flags: review?(esawin) → review+
Pushed by nchen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3a434c64cc3b
1. Record the type of null native pointers; r=esawin
https://hg.mozilla.org/integration/mozilla-inbound/rev/81735bcac861
2. Revert workaround from bug 1372777; r=esawin
https://hg.mozilla.org/integration/mozilla-inbound/rev/094fa02f34a3
3. Use XPCOM queue for disposing compositor native objects; r=esawin
Check for shutdown so we don't crash on accessing a null window pointer.
Attachment #8893517 - Flags: review?(esawin)
Attachment #8893517 - Flags: review?(esawin) → review+
Pushed by nchen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c1560f6fb5da
4. Check for shutdown in more LayerViewSupport functions; r=esawin
I had autophone-4 running a test of the latest version of the patch from Bug 1371291 where it now waits for runGecko.

<https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&filter-searchStr=autophone&exclusion_profile=false&fromchange=638bd646ed37daad51055480be194f00e471dd03&tochange=c14969b9b65bf1d497e3445161f52414bd4c1672>

You can see that before this landing at Fri Aug 4, 17:22:49 <https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=c1560f6fb5dae475b33373fdf4c2e60329e29f16&filter-searchStr=autophone> we had very consistent crashes which were greatly reduced though not eliminated afterwards.

The signatures we are still getting are:

unknown top frame] (Known intermittent)

[@ libc.so + 0x28a74] (Known intermittent)

[@ libc.so + 0x49da0]

[@ libdvm.so + 0x4bb80]

[@ nsWindow::LayerViewSupport::SyncResumeResizeCompositor(mozilla::jni::LocalRef<mozilla::java::LayerView::Compositor> const&, int, int, mozilla::jni::Ref<mozilla::jni::Object, _jobject*> const&)::OnResumedEvent::Run() + 0x3b]

[@ MessageLoop::PostTask_Helper] (Known intermittent)

It seems that there is still a slight problem with shutdown and LayerView. This happens on both android versions I have available on that host: Android 4.4 and 7.1.

https://treeherder.allizom.org/logviewer.html#?job_id=114441482&repo=mozilla-inbound&lineNumber=8082
https://treeherder.allizom.org/logviewer.html#?job_id=114441482&repo=mozilla-inbound&lineNumber=8081

You can see the home screen is displayed, the shutdown intent is sent and it takes about 8 seconds for the app to shut down where it crashes but it does shut down before Autophone has the need to force stop it.

I am going to run another experiment where I will wait for GeckoTabs(.*): zerdatime .* - page load start before issuing the shutdown intent.
Changing to detecting page load start didn't really help.
Simply adding the check for mWindow to SyncResumeResizeCompositor didn't help: https://treeherder.allizom.org/#/jobs?repo=try&revision=4e23c71995e5bdacc08e349a4902082d4903ec6d&exclusion_profile=false

Jim: Any thoughts?
Flags: needinfo?(nchen)
Blocks: 1371291
It wasn't enough to just check for `mWindow` in `OnResumedCompositor()`,
because the `LayerViewSupport` instance itself could have been released.
This patch now checks for both cases.
Attachment #8895434 - Flags: review?(esawin)
Part 5 should fix that one last crash.
Flags: needinfo?(nchen)
Attachment #8895434 - Flags: review?(esawin) → review+
Jim, can we go ahead and land this so I can test whether it does indeed resolve the problems?
Flags: needinfo?(nchen)
Pushed by nchen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ca55311e70a9
5. Check for shutdown in LayerViewSupport::OnResumedEvent; r=esawin
Just landed. Thanks for waiting.
Flags: needinfo?(nchen)
We still have a too high number of crashes but it is not clear they are the same issue as this bug.

https://treeherder.allizom.org/logviewer.html#?job_id=116486541&repo=mozilla-inbound&lineNumber=23057
PROCESS-CRASH | autophone-s1s2 | application crashed [@ libc.so + 0x49da0]
...
 8  libxul.so!abort_message [abort_message.cpp : 74 + 0xe]
 9  libxul.so!__cxa_pure_virtual [cxa_virtual.cpp : 21 + 0x12]
10  libxul.so!readEncodedPointer [cxa_personality.cpp : 292 + 0x6]
11  libxul.so!mozilla::layers::LayerManagerComposite::RenderToPresentationSurface [LayerManagerComposite.cpp:59c9e4c92782 : 1071 + 0x3]
...

Unfortunately I didn't capture the abort message.

https://treeherder.allizom.org/logviewer.html#?job_id=116415508&repo=mozilla-inbound&lineNumber=13027
PROCESS-CRASH | autophone-s1s2 | application crashed [@ libdvm.so + 0x4bb80]
...
 5  libxul.so!mozilla::gl::CreateSurfaceFromNativeWindow [GLContextProviderEGL.cpp:eb73e2d20c19 : 139 + 0x5]
 6  libxul.so!mozilla::gl::GLContextEGL::RenewSurface [GLContextProviderEGL.cpp:eb73e2d20c19 : 405 + 0x7]
 7  libxul.so!mozilla::layers::CompositorBridgeParent::ResumeComposition [CompositorBridgeParent.cpp:eb73e2d20c19 : 733 + 0x5]
 8  libxul.so!mozilla::layers::UiCompositorControllerParent::RecvResumeAndResize [UiCompositorControllerParent.cpp:eb73e2d20c19 : 74 + 0x7]
...

I have a number of pending jobs for mozilla-inbound, autoland and mozilla-central where I can continue to test this with the new non-quitter version of autophone. I'll see about making sure I get a logcat dump after a crash to see if we can find the abort message. I will also try to add a delay to see if that has any affect though it would not be my first choice.
Those crashes look unrelated to this bug. I think Randall is working on them in another bug?
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Please request Beta approval on the applicable patches when you get a chance.
Flags: needinfo?(nchen)
Keywords: leave-open
Comment on attachment 8893517 [details] [diff] [review]
4. Check for shutdown in more LayerViewSupport functions (v1)

Uplift request for part 4 (attachment 8893517 [details] [diff] [review]) and part 5 (attachment 8895434 [details] [diff] [review])

Approval Request Comment
[Feature/Bug causing the regression]: N/A
[User impact if declined]: Possible crashes during shutdown
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: No
[Is the change risky?]: No
[Why is the change risky/not risky?]: Patch only deals with situations where we used to crash on shutdown.
[String changes made/needed]: None
Flags: needinfo?(nchen)
Attachment #8893517 - Flags: approval-mozilla-beta?
Comment on attachment 8893517 [details] [diff] [review]
4. Check for shutdown in more LayerViewSupport functions (v1)

Please uplift parts 4 and 5 to beta. This is a top crash on release 55 fennec, with only a few crashes on other branches.  
I'm not sure we'll be able to tell the difference in beta, but let's try this.
Attachment #8893517 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 8895434 [details] [diff] [review]
5. Check for shutdown in LayerViewSupport::OnResumedEvent (v1)

[Triage Comment]
Attachment #8895434 - Flags: approval-mozilla-beta+
Jim, are you aiming any of this fix at 55 release? Thanks.
Flags: needinfo?(nchen)
Patch 3, 4, 5 should be okay for 55, but I think I prefer them staying on 56.
Flags: needinfo?(nchen)
Double checking, it looked like you only mean to uplift patches 4 and 5 to 56.  Should we also take part 3?
Flags: needinfo?(nchen)
Part 3 is already on 56 :)  Part 4 and 5 missed the 56 merge date.
Flags: needinfo?(nchen)
See Also: → 1449567
You need to log in before you can comment on or make changes to this bug.