Closed Bug 912238 Opened 6 years ago Closed 6 years ago

Intermittent Android PROCESS-CRASH | java-exception | android.os.RemoteException: Unknown binder error code. 0xfffffff7at android.os.BinderProxy.transact(Native Method)

Categories

(Firefox for Android :: General, defect)

ARM
Android
defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 30
Tracking Status
firefox28 --- unaffected
firefox29 --- unaffected
firefox30 --- fixed
firefox-esr24 --- unaffected

People

(Reporter: RyanVM, Assigned: kats)

References

Details

(Keywords: crash, intermittent-failure)

Attachments

(1 file, 1 obsolete file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=27335968&tree=Mozilla-Inbound

Android 4.0 Panda mozilla-inbound opt test mochitest-4 on 2013-09-03 11:57:50 PDT for push 2255560d9b6c
slave: panda-0852

12:16:22     INFO -  38895 INFO TEST-START | Shutdown
12:16:22     INFO -  38896 INFO Passed:  38057
12:16:22     INFO -  38897 INFO Failed:  0
12:16:22     INFO -  38898 INFO Todo:    200
12:16:22     INFO -  38899 INFO Slowest: 13372ms - /tests/content/media/webaudio/test/test_audioBufferSourceNodeNeutered.html
12:16:22     INFO -  38900 INFO SimpleTest FINISHED
12:16:22     INFO -  INFO | automation.py | Application ran for: 0:06:11.412514
12:16:22     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmpBNtulVpidlog
12:16:23     INFO -  /data/anr/traces.txt not found
12:16:23  WARNING -  PROCESS-CRASH | java-exception | android.os.RemoteException: Unknown binder error code. 0xfffffff7at android.os.BinderProxy.transact(Native Method)
12:16:23     INFO -  WARNING | leakcheck | refcount logging is off, so leaks can't be detected!
12:16:23     INFO -  Mochitest INFO | runtests.py | Running tests: end.
12:16:26     INFO -  09-03 12:16:09.085 D/GeckoAppShell( 2182): Killing via System.exit()
12:16:26     INFO -  09-03 12:16:09.093 E/JavaBinder( 2182): Unknown binder error code. 0xfffffff7
12:16:26     INFO -  09-03 12:16:09.109 V/TabletStatusBar( 1481): setLightsOn(true)
12:16:26     INFO -  09-03 12:16:09.117 V/GeckoHealthRec( 2182): Recorded session entry for env 1, current is 1
12:16:26     INFO -  09-03 12:16:09.117 D/GeckoSessInfo( 2182): Recording session done: 1378210211652
12:16:26     INFO -  09-03 12:16:09.117 F/libc    ( 2182): Fatal signal 11 (SIGSEGV) at 0x62afb8c2 (code=1)
12:16:26     INFO -  09-03 12:16:09.187 E/JavaBinder( 2182): Unknown binder error code. 0xfffffff7
12:16:26     INFO -  09-03 12:16:09.187 W/dalvikvm( 2182): threadid=11: thread exiting with uncaught exception (group=0x40b741f8)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): >>> REPORTING UNCAUGHT EXCEPTION FROM THREAD 87 ("GeckoBackgroundThread")
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): android.os.RemoteException: Unknown binder error code. 0xfffffff7
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): 	at android.os.BinderProxy.transact(Native Method)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): 	at android.accounts.IAccountManager$Stub$Proxy.getAccounts(IAccountManager.java:474)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): 	at android.accounts.AccountManager.getAccountsByType(AccountManager.java:400)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): 	at org.mozilla.gecko.sync.setup.SyncAccounts.syncAccountsExist(SyncAccounts.java:68)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): 	at org.mozilla.gecko.Tabs$1.run(Tabs.java:76)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): 	at android.os.Handler.handleCallback(Handler.java:605)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): 	at android.os.Handler.dispatchMessage(Handler.java:92)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): 	at android.os.Looper.loop(Looper.java:137)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): 	at org.mozilla.gecko.util.GeckoBackgroundThread.run(GeckoBackgroundThread.java:32)
12:16:26     INFO -  09-03 12:16:09.187 E/GeckoAppShell( 2182): Main thread stack:
12:16:26     INFO -  09-03 12:16:09.203 E/GeckoAppShell( 2182): android.os.MessageQueue.nativePollOnce(Native Method)
12:16:26     INFO -  09-03 12:16:09.203 E/GeckoAppShell( 2182): android.os.MessageQueue.next(MessageQueue.java:118)
12:16:26     INFO -  09-03 12:16:09.203 E/GeckoAppShell( 2182): android.os.Looper.loop(Looper.java:118)
12:16:26     INFO -  09-03 12:16:09.203 E/GeckoAppShell( 2182): android.app.ActivityThread.main(ActivityThread.java:4424)
12:16:26     INFO -  09-03 12:16:09.203 E/GeckoAppShell( 2182): java.lang.reflect.Method.invokeNative(Native Method)
12:16:26     INFO -  09-03 12:16:09.203 E/GeckoAppShell( 2182): java.lang.reflect.Method.invoke(Method.java:511)
12:16:26     INFO -  09-03 12:16:09.203 E/GeckoAppShell( 2182): com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:784)
12:16:26     INFO -  09-03 12:16:09.203 E/GeckoAppShell( 2182): com.android.internal.os.ZygoteInit.main(ZygoteInit.java:551)
12:16:26     INFO -  09-03 12:16:09.203 E/GeckoAppShell( 2182): dalvik.system.NativeStart.main(Native Method)
12:16:26     INFO -  09-03 12:16:09.210 F/libc    ( 2182): Fatal signal 11 (SIGSEGV) at 0x62bed878 (code=1)
12:16:26     INFO -  09-03 12:16:09.312 W/NetworkManagementSocketTagger( 1402): setKernelCountSet(10046, 0) failed with errno -2
12:16:26     INFO -  09-03 12:16:09.312 I/ActivityManager( 1402): Process org.mozilla.fennec (pid 2182) has died.
12:16:26     INFO -  09-03 12:16:09.312 I/WindowManager( 1402): WIN DEATH: Window{41410dd0 org.mozilla.fennec/org.mozilla.fennec.App paused=false}
12:16:26     INFO -  09-03 12:16:09.312 I/WindowManager( 1402): WIN DEATH: Window{41479b60 SurfaceView paused=false}
12:16:26     INFO -  09-03 12:16:09.320 I/DEBUG   ( 1290): debuggerd committing suicide to free the zombie!
12:16:26     INFO -  09-03 12:16:09.328 I/DEBUG   ( 2632): debuggerd: May 23 2012 14:36:54
The call that's dying is:

    final boolean accountsExist = AccountManager.get(c).getAccountsByType(SyncConstants.ACCOUNTTYPE_SYNC).length > 0;

Probably what's happening here is something like

  http://stackoverflow.com/questions/6729718/activity-restarted-when-application-is-exited-from-unexception-handler

-- that is, we're calling System.exit (which we should not do), but the activity is still starting up, and the background thread is still working. The context has been invalidated, so everything is broken.

Solution: don't call System.exit. Really. If you do call System.exit, you must ignore all exceptions that occur afterwards, because there's nothing we can do to stop Android from screaming.
Mark, can you help us find someone to look into this please?
Flags: needinfo?(mark.finkle)
Sounds like we have a diagnostic in comment 2. I assume this is happening during Shutdonw, by the look og the output. I also assume that the shutdown is from someone terminating Fennec. I don't think Fennec shutsdown on it's own from the tests.

If the logic in comment 2 is to be followed, when we terminate/shutdown Fennec, we need to ignore any exceptions afterward.

Make sense?
Flags: needinfo?(mark.finkle)
A different approach would be to add a general "silencer" try/catch to this block of code:
http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/Tabs.java#88

It will already quietly eat a security exception.
(In reply to Mark Finkle (:mfinkle) from comment #86)
> If the logic in comment 2 is to be followed, when we terminate/shutdown
> Fennec, we need to ignore any exceptions afterward.
> 
> Make sense?

We report "PROCESS-CRASH | java-exception | ..." by parsing the logcat for "REPORTING UNCAUGHT EXCEPTION". I suspect we could avoid this by stopping that logcat parsing when we hit something like "Killing via System.exit()", but that seems like a fragile, slightly magical, solution.
(In reply to Richard Newman [:rnewman] from comment #2)
> Solution: don't call System.exit. Really. If you do call System.exit, you
> must ignore all exceptions that occur afterwards, because there's nothing we
> can do to stop Android from screaming.

I am pretty sure that the System.exit() call is at http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/GeckoAppShell.java#714:

702     static void onXreExit() {
      ...
713         Log.d(LOGTAG, "Killing via System.exit()");
714         System.exit(0);
715     }

Can we avoid that?
This intermittent goes perma-orange on landing bug 958596 so fixing it sooner rather than later is good. Possible patch coming up.
Attached patch Patch (obsolete) — Splinter Review
Pushed to try on top of the inbound csets for bug 958596 here:

https://tbpl.mozilla.org/?tree=Try&rev=a46f60fc770d
Attachment #8382189 - Flags: review?(mark.finkle)
Attached patch Patch v2Splinter Review
Oops, I missed a couple of calls to System.exit. Might as well catch those too. New try push:

https://tbpl.mozilla.org/?tree=Try&rev=6d669432c611
Attachment #8382189 - Attachment is obsolete: true
Attachment #8382189 - Flags: review?(mark.finkle)
Attachment #8382197 - Flags: review?(mark.finkle)
Comment on attachment 8382197 [details] [diff] [review]
Patch v2

Review of attachment 8382197 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/base/GeckoApp.java
@@ +1841,5 @@
>  
>      @Override
>      protected void onNewIntent(Intent intent) {
>          if (GeckoThread.checkLaunchState(GeckoThread.LaunchState.GeckoExiting)) {
>              // We're exiting and shouldn't try to do anything else just incase

nit: can you fix this comment?

::: mobile/android/base/GeckoAppShell.java
@@ +385,5 @@
>      @WrapElementForJNI(allowMultithread = true, generateStatic = true, noThrow = true)
>      public static void handleUncaughtException(Thread thread, Throwable e) {
> +        if (GeckoThread.checkLaunchState(GeckoThread.LaunchState.GeckoExited)) {
> +            // We've called System.exit. All exceptions after this point are Android
> +            // berating us for being nasty to it

nit: trailing .

@@ +704,5 @@
>                  getGeckoInterface().getActivity().finish();
>              }
>          }
>  
>          Log.d(LOGTAG, "Killing via System.exit()");

This is true, but not as helpful as identifying |systemExit| (which is likely to grow functionality).
Attachment #8382197 - Flags: feedback+
Looks like the talos tests are still barfing all over the place. Anybody have any ideas?
Joel -- can you point out how/where the non-robocop Talos tests quit the browser?
Flags: needinfo?(jmaher)
this is the code we use to quit:
http://hg.mozilla.org/build/talos/file/7674ffd4b494/talos/page_load_test/quit.js

for ts tests, we include this script and call quit after we report the numbers.

for pageloader, we include quit in the extension and call it when we have loaded all the pages.
Flags: needinfo?(jmaher)
To me it doesn't even look like the talos tests are running - they hang before they even run, no?
I think they are running.

Consider https://tbpl.mozilla.org/php/getParsedLog.php?id=35283934&tree=Try&full=1

07:15:20     INFO -  INFO : Reading 'browser_version' from remoteapp.ini => 30.0a1
07:15:20     INFO -  INFO : Reading 'browser_name' from remoteapp.ini => Fennec
07:15:20     INFO -  INFO : Reading 'repository' from remoteapp.ini => https://hg.mozilla.org/try
07:15:20     INFO -  INFO : Reading 'sourcestamp' from remoteapp.ini => 6d669432c611
07:15:20     INFO -  panda-0030:
07:15:20     INFO -  		Started Wed, 26 Feb 2014 07:15:20
07:15:20     INFO -  Running test tsvgx:
07:15:20     INFO -  		Started Wed, 26 Feb 2014 07:15:20

command timed out: 3600 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=4421.143499


Unfortunately, mozharness started the test with: "python /builds/panda-0030/talos-data/talos/run_tests.py ..." Since python -u was not used, the python process will buffer its output. 

I suspect what happens is run_tests.py runs the test, waits indefinitely for the browser to end, and then mozharness kills run_tests.py after 3600 seconds, losing the buffered output from run_tests.py.

There's some speculation there...it could be happening differently. But I've seen this pattern before, and it broadly explains why a change to quit behavior would affect the test.
Hm, interesting. Can we run it without the -u option to get unbuffered output? Even if it's just for try purposes?
I don't think there's a way to change it on try, but I'll submit a mozharness patch in bug 977298 -- it may take a few days to get to production.
I also did a try push of just this patch alone on top of the latest m-c code, without botond's logging patches underneath. That try push is at https://tbpl.mozilla.org/?tree=Try&rev=689182718a5c and is looking pretty clean. That makes me suspect that botond's patches were responsible for the talos errors, and this patch can probably land to fix the RemoteException problem. Will wait until it's fully green with some retriggers just to be safe.
Attachment #8382197 - Flags: review?(mark.finkle) → review+
Try push is clean so far (although there's still quite a backlog on the android 2.2 retriggers), so landing with Nick's comments addressed:

https://hg.mozilla.org/integration/fx-team/rev/3b92bea7b2e7
Assignee: nobody → bugmail.mozilla
https://hg.mozilla.org/mozilla-central/rev/3b92bea7b2e7
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 30
Please request Aurora/Beta approval for this when you're reasonably confident that it's baked long enough.
Discussed with RyanVM on IRC; this doesn't seem to be occuring on 28 or 29 anymore, so marking those unaffected. If we do get reports of it occurring there then we can consider uplifting the patch.
You need to log in before you can comment on or make changes to this bug.