Closed Bug 883539 Opened 11 years ago Closed 9 years ago

Frequent Android 4.0 rc2 retries: "Remote Device Error: unable to connect to %s after %s attempts" % (self.host, retryLimit)

Categories

(Firefox for Android Graveyard :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gbrown, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [qa-automation-blocked])

Attachments

(1 file)

Beginning June 12 on m-i, there are frequent retries of rc2 on pandas with "Remote Device Error: unable to connect to %s after %s attempts" % (self.host, retryLimit)

Logcats show "*** WATCHDOG KILLING SYSTEM PROCESS:" during various rc2 tests.

We have seen this condition before, but it mysteriously stopped happening some weeks ago.

Likely regressed on m-i by https://hg.mozilla.org/integration/mozilla-inbound/rev/0c890f39dfa6, bugs 828654, 840127, 882319 (FHR).

Compare https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Android&rev=4ae3e1cede0a (5 of these failures in the first 10 runs) with the previous job, https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Android&rev=a4e3ffaf7743.  a4e3ffaf7743 has a couple of retries, but they have a different cause and happen before any tests are run.
What does this error actually mean? That Fennec isn't running? What is the harness trying to do that causes this error? 

Those patches add an alarm driven service, but that's routine Android stuff.
(In reply to Geoff Brown [:gbrown] from comment #1)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=24192440&tree=Mozilla-
> Inbound&full=1
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=24192441&tree=Mozilla-
> Inbound&full=1
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=24192538&tree=Mozilla-
> Inbound&full=1

So these three logs all fail immediately after testDistribution.  Is this meaningful?
(In reply to Geoff Brown [:gbrown] from comment #1)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=24192440&tree=Mozilla-
> Inbound&full=1

This log looks suspicious:

06-15 06:09:36.625 D/GeckoSessInfo( 4394): Building SessionInformation from prefs: 0, 0, true, false
06-15 06:09:36.640 I/GeckoApp( 4394): Creating BrowserHealthRecorder.
06-15 06:09:36.640 D/GeckoHealthRec( 4394): Initializing. Dispatcher is org.mozilla.gecko.util.EventDispatcher@41222f10
06-15 06:09:36.640 D/GeckoHealthRec( 4394): Initializing profile cache.
06-15 06:09:36.640 D/GeckoHealthRec( 4394): Looking for /mnt/sdcard/tests/profile/times.json
06-15 06:09:36.656 I/dalvikvm( 4394): Could not find method org.mozilla.gecko.GeckoTextView.setImportantForAccessibility, referenced from method org.mozilla.gecko.BrowserToolbar.from
06-15 06:09:36.656 W/dalvikvm( 4394): VFY: unable to resolve virtual method 11780: Lorg/mozilla/gecko/GeckoTextView;.setImportantForAccessibility (I)V
06-15 06:09:36.656 I/dalvikvm( 4394): Could not find method android.widget.ImageButton.setImportantForAccessibility, referenced from method org.mozilla.gecko.BrowserToolbar.from
06-15 06:09:36.656 W/dalvikvm( 4394): VFY: unable to resolve virtual method 4534: Landroid/widget/ImageButton;.setImportantForAccessibility (I)V
06-15 06:09:36.656 I/dalvikvm( 4394): Could not find method android.widget.ImageView.setImportantForAccessibility, referenced from method org.mozilla.gecko.BrowserToolbar.from
06-15 06:09:36.656 W/dalvikvm( 4394): VFY: unable to resolve virtual method 4557: Landroid/widget/ImageView;.setImportantForAccessibility (I)V
06-15 06:09:36.656 D/GeckoHealthRec( 4394): Incorporating environment: profile creation = 1371275003387
06-15 06:09:36.656 D/GeckoHealthRec( 4394): Done initializing profile cache. Beginning storage init.
06-15 06:09:36.765 I/Process ( 1403): Sending signal. PID: 4394 SIG: 3
06-15 06:09:36.765 I/dalvikvm( 4394): threadid=3: reacting to signal 3
06-15 06:09:36.781 I/dalvikvm( 4394): Wrote stack traces to '/data/anr/traces.txt'

This all happens before Gecko starts up.  Is it possible we're seeing a bad interaction here?  Or that the testing profile data is old and causing problems?
> > https://tbpl.mozilla.org/php/getParsedLog.php?id=24192441&tree=Mozilla-
> > Inbound&full=1

We should probably address this:

06-15 06:09:26.375 D/GeckoHealthRec( 4363): Recording session end: P
06-15 06:09:26.375 E/GeckoHealthRec( 4363): Attempted to record session end without initialized recorder.
Further investigation suggests that testFindInPage might be the culprit.  Search for '3153936' -- these are screen grabs being taken in build/mobile/robocop/PaintedSurface.  testFindInPage isn't close()ing the grabbed surface.  If you look at the logs post testFindInPage, you'll see the same thing:

06-15 06:19:39.250 I/Robocop ( 4371): 5 INFO TEST-PASS | testFindInPage | Checking if the next button was clicked - button was clicked
06-15 06:19:39.351 E/libEGL  ( 4371): eglMakeCurrent:696 error 300d (EGL_BAD_SURFACE)
06-15 06:19:39.390 D/GeckoLayerClient( 4371): Window-size changed to (1280,616)
06-15 06:19:39.406 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:39.429 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:42.335 V/tiny_hw ( 1294): out_standby(0x1892160) closing PCM
06-15 06:19:42.554 I/dalvikvm-heap( 4371): Grow heap (frag case) to 11.360MB for 3153936-byte allocation
06-15 06:19:43.710 I/Robocop ( 4371): PaintExpecter: no longer listening for events
06-15 06:19:43.710 I/Robocop ( 4371): 6 INFO TEST-PASS | testFindInPage | Pixel at 640,77 - Color rgba(255,0,0,255)  close enough to expected rgb(255,0,0)
06-15 06:19:43.804 D/InputEventConsistencyVerifier( 4371): KeyEvent: ACTION_UP but key was not down.
06-15 06:19:43.804 D/InputEventConsistencyVerifier( 4371):   in Handler (android.view.ViewRootImpl) {41293ce0}
06-15 06:19:43.804 D/InputEventConsistencyVerifier( 4371):   0: sent at 1794014000000, KeyEvent { action=ACTION_UP, keyCode=KEYCODE_MENU, scanCode=0, metaState=0, flags=0x8, repeatCount=0, eventTime=1794014, downTime=1794014, deviceId=-1, source=0x101 }
06-15 06:19:46.085 I/Robocop ( 4371): 7 INFO TEST-PASS | testFindInPage | Looking for the next search match button in the Find in Page UI - Found the next match button
06-15 06:19:46.117 V/tiny_hw ( 1294): out_write(0x1892160) opening PCM
06-15 06:19:46.265 E/libEGL  ( 4371): eglMakeCurrent:696 error 300d (EGL_BAD_SURFACE)
06-15 06:19:46.320 D/GeckoLayerClient( 4371): Window-size changed to (1280,568)
06-15 06:19:46.343 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:46.398 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:46.625 E/libEGL  ( 4371): eglMakeCurrent:696 error 300d (EGL_BAD_SURFACE)
06-15 06:19:46.687 D/GeckoLayerClient( 4371): Window-size changed to (1280,188)
06-15 06:19:46.687 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:46.781 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:46.843 D/GeckoEventDispatcher( 4371): dispatchEvent: no listeners registered for event 'robocop:scroll'
06-15 06:19:46.914 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.015 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.070 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.125 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.179 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.250 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.320 D/GeckoEventDispatcher( 4371): dispatchEvent: no listeners registered for event 'robocop:scroll'
06-15 06:19:47.390 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.492 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.609 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.640 E/libEGL  ( 4371): eglMakeCurrent:696 error 300d (EGL_BAD_SURFACE)
06-15 06:19:47.648 D/GeckoLayerClient( 4371): Window-size changed to (1280,568)
06-15 06:19:47.664 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.695 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.757 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.804 I/Robocop ( 4371): 8 INFO TEST-PASS | testFindInPage | Checking if the next button was clicked - button was clicked
06-15 06:19:47.812 D/GeckoEventDispatcher( 4371): dispatchEvent: no listeners registered for event 'robocop:scroll'
06-15 06:19:47.835 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:47.859 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:48.320 I/Robocop ( 4371): 9 INFO TEST-PASS | testFindInPage | Checking if the next button was clicked - button was clicked
06-15 06:19:48.351 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:48.468 E/libEGL  ( 4371): eglMakeCurrent:696 error 300d (EGL_BAD_SURFACE)
06-15 06:19:48.476 D/GeckoLayerClient( 4371): Window-size changed to (1280,616)
06-15 06:19:48.484 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:48.531 D/Robocop ( 4371): Received drawFinished notification
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): problem reading network stats
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): java.lang.IllegalStateException: problem parsing idx 1
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.net.NetworkStatsFactory.readNetworkStatsDetail(NetworkStatsFactory.java:300)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.net.NetworkStatsFactory.readNetworkStatsDetail(NetworkStatsFactory.java:250)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.os.BatteryStatsImpl.getNetworkStatsDetailGroupedByUid(BatteryStatsImpl.java:5734)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.os.BatteryStatsImpl.access$100(BatteryStatsImpl.java:76)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.os.BatteryStatsImpl$Uid.computeCurrentTcpBytesReceived(BatteryStatsImpl.java:2457)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.os.BatteryStatsImpl$Uid.getTcpBytesReceived(BatteryStatsImpl.java:2446)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.os.BatteryStatsImpl.writeSummaryToParcel(BatteryStatsImpl.java:5437)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.os.BatteryStatsImpl.writeLocked(BatteryStatsImpl.java:4836)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.os.BatteryStatsImpl.writeAsyncLocked(BatteryStatsImpl.java:4818)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.server.am.ActivityManagerService.updateCpuStatsNow(ActivityManagerService.java:1652)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.server.am.ActivityManagerService$3.run(ActivityManagerService.java:1534)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): Caused by: java.io.FileNotFoundException: /proc/net/xt_qtaguid/stats: open failed: ENOENT (No such file or directory)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at libcore.io.IoBridge.open(IoBridge.java:406)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at java.io.FileInputStream.<init>(FileInputStream.java:78)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at com.android.internal.net.NetworkStatsFactory.readNetworkStatsDetail(NetworkStatsFactory.java:269)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	... 10 more
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): Caused by: libcore.io.ErrnoException: open failed: ENOENT (No such file or directory)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at libcore.io.Posix.open(Native Method)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at libcore.io.BlockGuardOs.open(BlockGuardOs.java:110)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	at libcore.io.IoBridge.open(IoBridge.java:390)
06-15 06:19:49.757 F/BatteryStatsImpl( 1401): 	... 12 more
06-15 06:19:51.406 V/tiny_hw ( 1294): out_standby(0x1892160) closing PCM
06-15 06:19:51.632 I/dalvikvm-heap( 4371): Grow heap (frag case) to 11.359MB for 3153936-byte allocation
06-15 06:19:51.710 E/StrictMode( 4371): A resource was acquired at attached stack trace but never released. See java.io.Closeable for information on avoiding resource leaks.
06-15 06:19:51.710 E/StrictMode( 4371): java.lang.Throwable: Explicit termination method 'close' not called
06-15 06:19:51.710 E/StrictMode( 4371): 	at dalvik.system.CloseGuard.open(CloseGuard.java:184)
06-15 06:19:51.710 E/StrictMode( 4371): 	at java.io.FileInputStream.<init>(FileInputStream.java:80)
06-15 06:19:51.710 E/StrictMode( 4371): 	at java.io.FileInputStream.<init>(FileInputStream.java:105)
06-15 06:19:51.710 E/StrictMode( 4371): 	at org.mozilla.fennec.PaintedSurface.<init>(PaintedSurface.java:30)
06-15 06:19:51.710 E/StrictMode( 4371): 	at org.mozilla.fennec.FennecNativeDriver.getPaintedSurface(FennecNativeDriver.java:311)
06-15 06:19:51.710 E/StrictMode( 4371): 	at org.mozilla.fennec.tests.PixelTest.waitForPaint(PixelTest.java:51)
06-15 06:19:51.710 E/StrictMode( 4371): 	at org.mozilla.fennec.tests.testFindInPage.testFindInPage(testFindInPage.java:28)
06-15 06:19:51.710 E/StrictMode( 4371): 	at java.lang.reflect.Method.invokeNative(Native Method)
06-15 06:19:51.710 E/StrictMode( 4371): 	at java.lang.reflect.Method.invoke(Method.java:511)
06-15 06:19:51.710 E/StrictMode( 4371): 	at android.test.InstrumentationTestCase.runMethod(InstrumentationTestCase.java:214)
06-15 06:19:51.710 E/StrictMode( 4371): 	at android.test.InstrumentationTestCase.runTest(InstrumentationTestCase.java:199)
06-15 06:19:51.710 E/StrictMode( 4371): 	at android.test.ActivityInstrumentationTestCase2.runTest(ActivityInstrumentationTestCase2.java:192)
06-15 06:19:51.710 E/StrictMode( 4371): 	at org.mozilla.fennec.tests.BaseTest.runTest(BaseTest.java:131)
06-15 06:19:51.710 E/StrictMode( 4371): 	at junit.framework.TestCase.runBare(TestCase.java:127)
06-15 06:19:51.710 E/StrictMode( 4371): 	at junit.framework.TestResult$1.protect(TestResult.java:106)
06-15 06:19:51.710 E/StrictMode( 4371): 	at junit.framework.TestResult.runProtected(TestResult.java:124)
06-15 06:19:51.710 E/StrictMode( 4371): 	at junit.framework.TestResult.run(TestResult.java:109)
06-15 06:19:51.710 E/StrictMode( 4371): 	at junit.framework.TestCase.run(TestCase.java:118)
06-15 06:19:51.710 E/StrictMode( 4371): 	at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:169)
06-15 06:19:51.710 E/StrictMode( 4371): 	at android.test.AndroidTestRunner.runTest(AndroidTestRunner.java:154)
06-15 06:19:51.710 E/StrictMode( 4371): 	at android.test.InstrumentationTestRunner.onStart(InstrumentationTestRunner.java:545)
06-15 06:19:51.710 E/StrictMode( 4371): 	at android.app.Instrumentation$InstrumentationThread.run(Instrumentation.java:1551)
06-15 06:19:52.640 I/Robocop ( 4371): PaintExpecter: no longer listening for events
06-15 06:19:52.640 I/Robocop ( 4371): 10 INFO TEST-PASS | testFindInPage | Pixel at 640,77 - Color rgba(255,255,255,255) is different enough from rgb(255,0,0)
06-15 06:19:52.640 I/Robocop ( 4371): 11 INFO TEST-END | testFindInPage | finished in 33836ms
06-15 06:19:52.640 I/Robocop ( 4371): 12 INFO TEST-START | Shutdown
06-15 06:19:52.640 I/Robocop ( 4371): 13 INFO Passed: 9
06-15 06:19:52.640 I/Robocop ( 4371): 14 INFO Failed: 0
06-15 06:19:52.640 I/Robocop ( 4371): 15 INFO Todo: 0
06-15 06:19:52.640 I/Robocop ( 4371): 16 INFO SimpleTest FINISHED
06-15 06:20:00.398 E/Profiler( 4371): BPUnw: [6 total] thread_unregister_for_profiling(me=0x2215970) 
06-15 06:20:00.398 E/Profiler( 4371): BPUnw: [5 total] thread_unregister_for_profiling(me=0x2175db8) 
06-15 06:20:41.789 I/Process ( 1401): Sending signal. PID: 1401 SIG: 3
06-15 06:20:41.789 I/dalvikvm( 1401): threadid=3: reacting to signal 3
06-15 06:20:41.828 I/dalvikvm( 1401): Wrote stack traces to '/data/anr/traces.txt'
06-15 06:20:51.250 E/GeckoConsole( 4371): [JavaScript Warning: "WARN addons.updates: Request failed: http://10.12.135.22:30763/extensions-dummy/updateBackgroundURL - 404: Not Found" {file: "resource://gre/modules/AddonUpdateChecker.jsm" line: 469}]
06-15 06:20:51.257 E/GeckoConsole( 4371): [JavaScript Warning: "WARN addons.updates: Request failed: http://10.12.135.22:30763/extensions-dummy/updateBackgroundURL - 404: Not Found" {file: "resource://gre/modules/AddonUpdateChecker.jsm" line: 469}]
06-15 06:21:11.843 I/Process ( 1401): Sending signal. PID: 1401 SIG: 3
06-15 06:21:11.843 I/dalvikvm( 1401): threadid=3: reacting to signal 3
06-15 06:21:11.859 I/dalvikvm( 1401): Wrote stack traces to '/data/anr/traces.txt'
06-15 06:21:11.859 I/Process ( 1401): Sending signal. PID: 1564 SIG: 3
06-15 06:21:11.859 I/dalvikvm( 1564): threadid=3: reacting to signal 3
06-15 06:21:11.882 I/dalvikvm( 1564): Wrote stack traces to '/data/anr/traces.txt'
06-15 06:21:13.914 I/Watchdog_N( 1401): dumpKernelStacks
06-15 06:21:14.054 I/dalvikvm-heap( 1401): Grow heap (frag case) to 9.327MB for 198144-byte allocation
06-15 06:21:14.187 I/dalvikvm-heap( 1401): Grow heap (frag case) to 9.291MB for 148896-byte allocation
06-15 06:21:15.960 I/Process ( 1401): Sending signal. PID: 1401 SIG: 9
06-15 06:21:15.960 W/Watchdog( 1401): *** WATCHDOG KILLING SYSTEM PROCESS: com.android.server.PowerManagerService

We might be resource starving Android's system processes, since you'll see that they SIG 3 (dump stacks) and then SIG 9 (kill) themselves.
(In reply to Nick Alexander :nalexander from comment #5)
> > > https://tbpl.mozilla.org/php/getParsedLog.php?id=24192441&tree=Mozilla-
> > > Inbound&full=1
> 
> We should probably address this:
> 
> 06-15 06:09:26.375 D/GeckoHealthRec( 4363): Recording session end: P
> 06-15 06:09:26.375 E/GeckoHealthRec( 4363): Attempted to record session end
> without initialized recorder.

Is this just due to the activity ending immediately after it began? If so, that's perfectly safe, and the intended behavior.
Hardware: x86 → ARM
See Also: → 883689
(In reply to Richard Newman [:rnewman] from comment #2)
> What does this error actually mean? That Fennec isn't running? What is the
> harness trying to do that causes this error? 

The "unable to connect to " message originates in devicemanagerSUT -- python code running on the test slave/foopy that communicates with the device (sutAgent), using sockets. If devicemanagerSUT gets a socket error when sending to the device, it tries to re-connect...and if that fails, it tries again...up to 5 times, with increasing pauses in between attempts. Still no connection? Then the test fails and this message is printed.

I suspect that either sutAgent is being killed, or some network service on the device is being killed, based on the logcat messages:

06-15 06:21:15.960 W/Watchdog( 1401): *** WATCHDOG KILLING SYSTEM PROCESS: com.android.server.PowerManagerService
...
06-15 06:21:16.031 I/ServiceManager( 1286): service 'ethernet' died
...
06-15 13:22:26.263 I/SUTAgentWatcher( 4932): Starting SUTAgent from watcher code


The WATCHDOG KILLING SYSTEM PROCESS message seems to be the start of all the trouble. I'm not sure what that really means. Out of memory? Fatal error in some system service? Some sort of ANR condition?

I will do some research today to try to determine the conditions that can produce WATCHDOG KILLING SYSTEM PROCESS.

Would someone volunteer to back out the suspected-regressing changesets, perhaps just in a try push, to see if that will resolve the test failures?

Also, I believe there are ANR dumps in some of the logs -- we could use some analysis of those.
(In reply to Nick Alexander :nalexander from comment #6)
> Further investigation suggests that testFindInPage might be the culprit. 

The last time we saw similar failures, we managed to correlate the failures with certain tests, and found that we could run the remaining tests reliably. See bug 862493. Note that it was a group of 7 tests, with very little in common. Some time later we were able to enable some of those tests, and later still, more, until we could no longer reproduce the failures at all. 

I hope we can achieve greater understanding and find a permanent solution this time.
(In reply to Geoff Brown [:gbrown] from comment #8)

> The WATCHDOG KILLING SYSTEM PROCESS message seems to be the start of all the
> trouble. I'm not sure what that really means. Out of memory? Fatal error in
> some system service? Some sort of ANR condition?

It means some condition has been met -- e.g., deadlock -- that's detectable by the Android watchdog. It kills the processes to break the condition.

This message implies that something in SUTAgent is (either directly or indirectly) causing system contention that makes the watchdog unhappy.

> Would someone volunteer to back out the suspected-regressing changesets,
> perhaps just in a try push, to see if that will resolve the test failures?

I'll do it for a try push, but I'm not sure what we'll gain -- if our test harness is falling over due to sheer weight of code, backing out the code isn't a fix.

I'll try doing some log reading today, see if we can more precisely point a finger.
The very last thing the Fennec process does -- after the activity has stopped, after FHR has recorded session end -- is:

06-15 06:10:55.570 E/GeckoConsole( 4363): [JavaScript Warning: "WARN addons.updates: Request failed: http://10.12.134.19:30636/extensions-dummy/updateBackgroundURL - 404: Not Found" {file: "resource://gre/modules/AddonUpdateChecker.jsm" line: 469}]

Soon after, you can see the two processes that Watchdog is killing:

06-15 06:11:20.218 I/Process ( 1403): Sending signal. PID: 1403 SIG: 3
06-15 06:11:20.250 I/Process ( 1403): Sending signal. PID: 1586 SIG: 3


1403 is the sytem process.
1586 is com.android.phone.


In response to quit, both processes start to write ANRs. They take too long, so the watchdog kill-9s them.

It seems too much of a coincidence that the battery service is throwing inside a locked section. If that's resulting in a different process trying to take that lock to do error reporting or recovery, it'll deadlock.

I don't see anything in these logs that suggest FHR is at fault.
Third log actually kills the system process before we fetch add-ons:

06-15 06:20:41.789 I/Process ( 1401): Sending signal. PID: 1401 SIG: 3
06-15 06:20:41.789 I/dalvikvm( 1401): threadid=3: reacting to signal 3
06-15 06:20:41.828 I/dalvikvm( 1401): Wrote stack traces to '/data/anr/traces.txt'
06-15 06:20:51.250 E/GeckoConsole( 4371): [JavaScript Warning: "WARN addons.updates: Request failed: http://10.12.135.22:30763/extensions-dummy/updateBackgroundURL - 404: Not Found" {file: "resource://gre/modules/AddonUpdateChecker.jsm" line: 469}]
Attached file Suspicious logs
I would definitely like to see DropBox ANR reporting disabled.
Binder Thread #8 (tid 64) holds ActivityManagerService, and wants BatteryStats.

ProcessStats (tid 13) holds BatteryStats and wants ActivityManagerService. It's in the middle of uploading a WTF log to DropBox, and to do so it wants to send a broadcast intent, which requires locking.


The fix for this is one of:

* Don't upload WTF logs to DropBox mid-error. This holds locks for too long, and apparently enough things on the pandas that throw WTFs.
* Fix whatever is causing this WTF (I would guess the network stats issue that's barfing on startup, and is called by BatteryStatsImpl).
I should clarify: this is android.os.DropBoxManager, not the DropBox cloud storage provider. It's system-provided log storage.
To disable DropBoxManager logging for WTFs:

        ContentResolver cr = getContext().getContentResolver();
        Settings.Secure.putString(cr, Settings.Secure.DROPBOX_TAG_PREFIX + "system_server_wtf",
                                  "disabled");
        Settings.Secure.putString(cr, Settings.Secure.DROPBOX_TAG_PREFIX + "system_app_wtf",
                                  "disabled");
        Settings.Secure.putString(cr, Settings.Secure.DROPBOX_TAG_PREFIX + "data_app_wtf",
                                  "disabled");
Depends on: 883966
I tried adding the code in Comment 17 to sutAgent, which has WRITE_SECURE_SETTINGS in its manifest, but it fails anyway:

E/AndroidRuntime( 2126): FATAL EXCEPTION: main
E/AndroidRuntime( 2126): java.lang.RuntimeException: Unable to start activity ComponentInfo{com.mozilla.SUTAgentAndroid/com.mozilla.SUTAgentAndroid.SUTAgentAndroid}: java.lang.SecurityException: Permission denial: writing to secure settings requires android.permission.WRITE_SECURE_SETTINGS
E/AndroidRuntime( 2126): 	at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:1956)
E/AndroidRuntime( 2126): 	at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:1981)
E/AndroidRuntime( 2126): 	at android.app.ActivityThread.access$600(ActivityThread.java:123)
E/AndroidRuntime( 2126): 	at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1147)
E/AndroidRuntime( 2126): 	at android.os.Handler.dispatchMessage(Handler.java:99)
E/AndroidRuntime( 2126): 	at android.os.Looper.loop(Looper.java:137)
E/AndroidRuntime( 2126): 	at android.app.ActivityThread.main(ActivityThread.java:4424)
E/AndroidRuntime( 2126): 	at java.lang.reflect.Method.invokeNative(Native Method)
E/AndroidRuntime( 2126): 	at java.lang.reflect.Method.invoke(Method.java:511)
E/AndroidRuntime( 2126): 	at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:784)
E/AndroidRuntime( 2126): 	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:551)
E/AndroidRuntime( 2126): 	at dalvik.system.NativeStart.main(Native Method)
E/AndroidRuntime( 2126): Caused by: java.lang.SecurityException: Permission denial: writing to secure settings requires android.permission.WRITE_SECURE_SETTINGS
E/AndroidRuntime( 2126): 	at android.os.Parcel.readException(Parcel.java:1327)
E/AndroidRuntime( 2126): 	at android.database.DatabaseUtils.readExceptionFromParcel(DatabaseUtils.java:182)
E/AndroidRuntime( 2126): 	at android.database.DatabaseUtils.readExceptionFromParcel(DatabaseUtils.java:136)
E/AndroidRuntime( 2126): 	at android.content.ContentProviderProxy.insert(ContentProviderNative.java:415)
E/AndroidRuntime( 2126): 	at android.content.ContentResolver.insert(ContentResolver.java:730)
E/AndroidRuntime( 2126): 	at android.provider.Settings$NameValueTable.putString(Settings.java:639)
E/AndroidRuntime( 2126): 	at android.provider.Settings$Secure.putString(Settings.java:2223)
E/AndroidRuntime( 2126): 	at com.mozilla.SUTAgentAndroid.SUTAgentAndroid.disableDropbox(SUTAgentAndroid.java:159)
E/AndroidRuntime( 2126): 	at com.mozilla.SUTAgentAndroid.SUTAgentAndroid.onCreate(SUTAgentAndroid.java:174)
E/AndroidRuntime( 2126): 	at android.app.Activity.performCreate(Activity.java:4465)
E/AndroidRuntime( 2126): 	at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1049)
E/AndroidRuntime( 2126): 	at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:1920)
E/AndroidRuntime( 2126): 	... 11 more

See http://stackoverflow.com/questions/13045283/write-secure-settings-permission-error-even-when-added-in-manifest.


I think we need another strategy.
Fixing network stats (Bug 883966) seems like the best option, then. *sigh*
You can also try editing the Android settings database manually - it's a sqlite database under /data that you can grab, edit, and then push back. (You would need root of course, and some kind of special blessing from releng if you want it across our infra)
Blocks: 878587
No longer blocks: 878587
Blocks: 807230
I am investigating the possibility of re-imaging the pandas, either:
 - a new AOSP build;
 - a new Linaro build, or;
 - patching the existing Linaro build.
Building/testing Android images is slow-going. This could take a while.
I reproduced this type of failure on try, with additional logging at:

https://tbpl.mozilla.org/?tree=Try&rev=7b0aff6d949f
https://tbpl.mozilla.org/?tree=Try&rev=6e8e0503111d

Curiously, I notice that most failures occur during the 30th or 33rd minute of uptime:

eg. https://tbpl.mozilla.org/php/getParsedLog.php?id=24835655&tree=Try&full=1#error0
07/02/2013 09:09:08: INFO:      raise DMError("Remote Device Error: unable to connect to %s after %s attempts" % (self.host, retryLimit))
07/02/2013 09:10:13: INFO:  Device info: {'uptime': ['0 days 0 hours 30 minutes 32 seconds 436 ms']
07/02/2013 09:10:13: INFO:  07-02 09:08:16.093 W/Watchdog( 1401): *** WATCHDOG KILLING SYSTEM PROCESS: com.android.server.PowerManagerService

eg. https://tbpl.mozilla.org/php/getParsedLog.php?id=24770273&tree=Try&full=1#error0
06/30/2013 20:11:04: INFO:      raise DMError("Remote Device Error: unable to connect to %s after %s attempts" % (self.host, retryLimit))
06/30/2013 20:12:12: INFO:  Device info: {'uptime': ['0 days 0 hours 33 minutes 24 seconds 710 ms'],
06/30/2013 20:12:11: INFO:  06-30 20:10:12.601 W/Watchdog( 1403): *** WATCHDOG KILLING SYSTEM PROCESS: com.android.server.PowerManagerService


If the watchdog action is related to system uptime, that would explain why these failures occur mostly during robocop jobs: those are the longest-running test jobs.

I will stay the course: try to come up with a more stable panda image.
I built my own AOSP 4.2.2 image, but ran into trouble flashing the panda. Then progress stalled when I moved my attention to other priorities (x86). I am still interested in replacing our panda image, but that will take time and I am not actively working on it at the moment.

Meanwhile, the rc2 retries continue. In fact they recently worsened when the Android 4.0 chunking was fixed. For some weeks we were running 4.0 robocop tests as chunk 1 of 3 and chunk 2 of 3, with 1/3 of tests not being run. Now that we are running all tests, we see more rc2 retries.

If we disable some long-running tests, rc2 runs cleanly: https://tbpl.mozilla.org/?tree=Try&rev=9ab3692af2cc. Incidentally, I picked the disabled tests there fairly randomly, only with an aim to reduce test time -- I suspect there is nothing "special" about those tests.

Another time-based experiment here: https://tbpl.mozilla.org/?tree=Try&rev=7ee5c01b563f, https://tbpl.mozilla.org/?tree=Try&rev=d0a7a9783935.

There appears to be a strong correlation between test time (system uptime?) and likelihood of retries. I have believed that this was a function of the watchdog and the deadlock identified by rnewman, but now there is a new idea....
:Callek thinks this is related to the mozpool timeout at http://mxr.mozilla.org/build/source/mozpool/mozpoolclient/mozpoolclient/mozpoolclient.py#182.

Also, jmaher notes that there is a buildbot timeout, which he thinks is 60 minutes.
Depends on: 895466
Whiteboard: [qa-automation-blocked]
It looks like increasing the mozpool timeout (bug 895466) did not fix this -- failures continue.

https://tbpl.mozilla.org/php/getParsedLog.php?id=25603382&tree=Mozilla-Inbound&full=1

We should consider running robocop as 3 or perhaps 4 chunks, since reducing the job time to less than 30 minutes appears to be an effective work-around.
(In reply to Geoff Brown [:gbrown] from comment #27)
> It looks like increasing the mozpool timeout (bug 895466) did not fix this
> -- failures continue.

Do you have evidence for this? when I increased the mozpool timeout I went based on what the timeout of the job was, a 30 minute "we get no output" timeout and a 4 hour "max runtime" timeout, I can't imagine a case where either of those are hit and we don't _already_ have a serious problem here.

From my skim of the bug it looks like we saw issues with failing to release/appropriate system resources properly which was/is starving the system (e.g. logs that show us losing ethernet, or killing off com.android.phone [which for the latter we saw break robocop horribly when we first turned on xpcshell since it wasn't cleaning up after itself properly]
(In reply to Geoff Brown [:gbrown] from comment #27)
> We should consider running robocop as 3 or perhaps 4 chunks, since reducing
> the job time to less than 30 minutes appears to be an effective work-around.

Err take my c#28 but with THIS part as the quote from c#27, its obvious to us all that the mozpool timeout didn't fix anything (with these tests)
So I am responding to:

> (In reply to Geoff Brown [:gbrown] from comment #27)
>> We should consider running robocop as 3 or perhaps 4 chunks, since reducing
>> the job time to less than 30 minutes appears to be an effective work-around.
>
> Do you have evidence for this?

Primarily comment 24 and comment 25: In try experiments, it seems that if we run fewer robocop tests, the watchdog does not kill processes and the tests pass. That's not conclusive, and says little about root cause, but suggests this work-around.
Bug 905274 split rc into 3 chunks on panda, reducing each chunk time to about 30 minutes. The retries reported here have nearly vanished. (There are still retries on all panda jobs as reported in bug 895176, and some other less frequent retries.)
Bug 926991 split rc into 4 chunks on panda, for similar reasons.

Now I am requesting bug 964363 split rc into 5 chunks, again because "Remote Device Error: unable to connect to %s after %s attempts" (caused by "WATCHDOG KILLING SYSTEM PROCESS") occurs when the robocop job run time gets too big. 

https://tbpl.mozilla.org/?tree=Try&rev=15113df8f2e9
https://tbpl.mozilla.org/php/getParsedLog.php?id=33637135&tree=Try&full=1#error0
Reviewing the recent logs, the pattern that I see is that WATCHDOG KILLING SYSTEM PROCESS is logged 30 minutes after the Panda reboot:

07:46:12     INFO -  01/25/2014 07:46:12: INFO: Calling PDU powercycle for panda-0173, panda-relay-015.p2.releng.scl1.mozilla.com:1:4
07:46:12     INFO -  01/25/2014 07:46:12: INFO: Waiting for device to come back...
07:47:42     INFO -  01/25/2014 07:47:42: INFO: Try 1
07:48:47     INFO -  01/25/2014 07:48:47: INFO: devroot None
07:50:18     INFO -  01/25/2014 07:50:18: INFO: Try 2
07:50:18     INFO -  01/25/2014 07:50:18: INFO: devroot /mnt/sdcard/tests
07:50:19     INFO -  01/25/2014 07:50:19: INFO: devroot /mnt/sdcard/tests
...
08:20:31     INFO -  01-25 08:19:06.617 W/Watchdog( 1402): *** WATCHDOG KILLING SYSTEM PROCESS: null


7:06:38     INFO -  01/25/2014 07:06:38: INFO: Calling PDU powercycle for panda-0168, panda-relay-014.p1.releng.scl1.mozilla.com:2:6
07:06:39     INFO -  01/25/2014 07:06:39: INFO: Waiting for device to come back...
07:08:09     INFO -  01/25/2014 07:08:09: INFO: Try 1
07:09:23     INFO -  01/25/2014 07:09:23: INFO: devroot None
07:10:53     INFO -  01/25/2014 07:10:53: INFO: Try 2
07:10:53     INFO -  01/25/2014 07:10:53: INFO: devroot /mnt/sdcard/tests
07:10:55     INFO -  01/25/2014 07:10:55: INFO: devroot /mnt/sdcard/tests
...
07:41:01     INFO -  01-25 07:39:36.453 W/Watchdog( 1402): *** WATCHDOG KILLING SYSTEM PROCESS: null


09:05:35     INFO -  01/27/2014 09:05:35: INFO: Calling PDU powercycle for panda-0151, panda-relay-013.p1.releng.scl1.mozilla.com:1:4
09:05:41     INFO -  01/27/2014 09:05:41: INFO: Waiting for device to come back...
09:07:11     INFO -  01/27/2014 09:07:11: INFO: Try 1
09:08:26     INFO -  01/27/2014 09:08:26: INFO: devroot None
09:09:56     INFO -  01/27/2014 09:09:56: INFO: Try 2
09:09:56     INFO -  01/27/2014 09:09:56: INFO: devroot /mnt/sdcard/tests
09:09:57     INFO -  01/27/2014 09:09:57: INFO: devroot /mnt/sdcard/tests
...
09:39:53     INFO -  01-27 09:38:29.734 W/Watchdog( 1402): *** WATCHDOG KILLING SYSTEM PROCESS: com.android.server.PowerManagerService
Inactive; closing (see bug 1180138).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: