Closed Bug 675297 Opened 13 years ago Closed 10 years ago

Intermittent Android "talosError: 'timeout exceeded'"

Categories

(Testing :: Talos, defect)

ARM
Android
defect
Not set
normal

Tracking

(firefox16 affected, firefox17 affected, firefox18 affected, firefox19 affected)

RESOLVED WORKSFORME
Tracking Status
firefox16 --- affected
firefox17 --- affected
firefox18 --- affected
firefox19 --- affected

People

(Reporter: mbrubeck, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [see comment 729][mobile_unittests][android_tier_1])

I've seen this a lot in the past few days; example:
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Inbound/1311951029.1311953387.22271.gz

Android Tegra 250 mozilla-inbound talos remote-tpan on 2011/07/29 07:50:29 
TinderboxPrint: s: tegra-072

tegra-072: 
		Started Fri, 29 Jul 2011 07:52:51
Running test tpan: 
		Started Fri, 29 Jul 2011 07:52:51
reconnecting socket
FIRE PROC: ' "MOZ_CRASHREPORTER_SHUTDOWN=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1" org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/getInfo.html'
reconnecting socket
FIRE PROC: ' "MOZ_CRASHREPORTER_SHUTDOWN=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1" org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
reconnecting socket
pushing directory: /tmp/tmpaLR_d2/profile to /mnt/sdcard/tests/profile
	Screen width/height:1024/768
	colorDepth:24
	Browser inner width/height: 800/580

NOISE: __start_report158__end_report
NOISE: __startTimestamp1311926175303__endTimestamp
reconnecting socket
FIRE PROC: ' "MOZ_CRASHREPORTER_SHUTDOWN=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1" org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
NOISE: __start_report108__end_report
NOISE: __startTimestamp1311926356153__endTimestamp
reconnecting socket
FIRE PROC: ' "MOZ_CRASHREPORTER_SHUTDOWN=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1" org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
NOISE: __start_report142__end_report
NOISE: __startTimestamp1311926538125__endTimestamp
reconnecting socket
FIRE PROC: ' "MOZ_CRASHREPORTER_SHUTDOWN=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1" org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
NOISE: __start_report129__end_report
NOISE: __startTimestamp1311926720165__endTimestamp
reconnecting socket
FIRE PROC: ' "MOZ_CRASHREPORTER_SHUTDOWN=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1" org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
NOISE: __start_report115__end_report
NOISE: __startTimestamp1311926902359__endTimestamp
reconnecting socket
FIRE PROC: ' "MOZ_CRASHREPORTER_SHUTDOWN=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1" org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
NOISE: __start_report156__end_report
NOISE: __startTimestamp1311927084136__endTimestamp
reconnecting socket
FIRE PROC: ' "MOZ_CRASHREPORTER_SHUTDOWN=1,MOZ_CRASHREPORTER_NO_REPORT=1,NO_EM_RESTART=1" org.mozilla.fennec  -profile /mnt/sdcard/tests/profile http://bm-remote.build.mozilla.org/startup_test/fennecmark/fennecmark.html?test=PanDown%26webServer=bm-remote.build.mozilla.org'
NOISE: __start_report112__end_report
NOISE: __startTimestamp1311927266604__endTimestamp
NOISE: __start_report112__end_report
NOISE: __startTimestamp1311927266604__endTimestamp
Failed tpan: 
		Stopped Fri, 29 Jul 2011 08:28:42
FAIL: Busted: tpan
FAIL: timeout exceeded
Completed test tpan: 
		Stopped Fri, 29 Jul 2011 08:28:42
https://tbpl.mozilla.org/php/getParsedLog.php?id=6358458
Whiteboard: [mobile_unittests][orange] → [mobile_unittests][orange][android_tier_1]
Assignee: nobody → bear
Depends on: 690311
https://tbpl.mozilla.org/php/getParsedLog.php?id=7477463&tree=Mozilla-Aurora

Seems a bit odd that the tip of aurora hit this on four of five runs - I only get to stop retriggering because I accidentally got two on the last retriggering, and even though the last one hit this, I got a green one on the next-to-last.
https://tbpl.mozilla.org/php/getParsedLog.php?id=7482932&tree=Mozilla-Inbound

That would have made more sense as "four of five runs of twinopen on a single push." So far, I'm only up to three of three on my inbound push. So far.
https://tbpl.mozilla.org/php/getParsedLog.php?id=7572050&tree=Firefox
Android XUL Tegra 250 mozilla-central talos remote-ts on 2011-11-24 09:54:56 PST for push 84117219ded0
FAIL: Busted: ts
FAIL: timeout exceeded
https://tbpl.mozilla.org/php/getParsedLog.php?id=8130425&tree=Mozilla-Inbound

(don't worry, I'll only retrigger the newly-returned Android XUL tpan ten or twenty times before accepting that it's going to hit this every single time)
Sort of vaguely interesting that (ignoring the totally broken tpan) this has settled down to pretty much just being ts: if there's a red ts, it's going to be timeout exceeded, if there's a red anything else, it'll be something else.
removing from my queue so it can be triaged - I sat on this thinking that mozharness would allow us to change enough so this bug could be caught and retried without burning the test run
Assignee: bear → nobody
https://tbpl.mozilla.org/php/getParsedLog.php?id=11104147&tree=Firefox
DeviceManager: error pulling file '/mnt/sdcard/tests/browser_output.txt': No such file or directory
getting files in '/mnt/sdcard/tests/profile/minidumps/'
Failed tcheck2: 
		Stopped Sat, 21 Apr 2012 12:55:23
FAIL: Busted: tcheck2
FAIL: timeout exceeded

This was purple on tbpl. Should it have picked up the "timeout exceeded" on gone red like the other talos tests?
The nice thing about those timeout exceeded robocheck runs is that they are also bug 722166 since it doesn't ever start  up, and bug 681861 since it keeps reconnecting, and bug 711725 since the eventual outcome is losing the connection to the tegra, so you can star them as practically anything that happens to suit you at the time.

https://tbpl.mozilla.org/php/getParsedLog.php?id=11101918&tree=Services-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=11378228&tree=Firefox

FAIL: Busted: tcheck2
FAIL: timeout exceeded
Traceback (most recent call last):
  File "run_tests.py", line 737, in <module>
    main()
  File "run_tests.py", line 734, in main
    test_file(arg, options, parser.parsed)
  File "run_tests.py", line 675, in test_file
    raise e
utils.talosError: 'timeout exceeded'
reconnecting socket
FIRE PROC: 'am instrument -w -e class org.mozilla.fennec.tests.testCheck2 org.mozilla.roboexample.test/android.test.InstrumentationTestRunner'

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
========= Finished 'python run_tests.py ...' interrupted (results: 4, elapsed: 5 hrs, 56 mins, 56 secs) (at 2012-05-01 21:36:04.032772) =========

The not so nice part is that we don't actually have enough Tegra capacity to be able to get away with these 6 hour runs.
Depends on: 790602
Summary: Intermittent "timeout exceeded" in Android Tegra Talos → Intermittent Android "talosError: 'timeout exceeded'"
Depends on: 797324
Whiteboard: [mobile_unittests][orange][android_tier_1] → [mobile_unittests][android_tier_1]
It looks like all of the recent failures here have been on Pandas.

Looking at the last couple of logs, it looks to me like the tests run to completion fairly normally / quickly...and then the test times out about 10 minutes later. Are we just not detecting the test completion?

I wonder if this might be related to:

02-18 06:17:51.109 I/System  ( 1874): Failed to destroy process 2567
02-18 06:17:51.125 I/System  ( 1874): libcore.io.ErrnoException: kill failed: ESRCH (No such process)
02-18 06:17:51.125 I/System  ( 1874): 	at libcore.io.Posix.kill(Native Method)
02-18 06:17:51.125 I/System  ( 1874): 	at libcore.io.ForwardingOs.kill(ForwardingOs.java:77)
02-18 06:17:51.125 I/System  ( 1874): 	at java.lang.ProcessManager$ProcessImpl.destroy(ProcessManager.java:257)
02-18 06:17:51.125 I/System  ( 1874): 	at com.mozilla.SUTAgentAndroid.service.RedirOutputThread.run(RedirOutputThread.java:148)

Has anyone looked into this error?
Whiteboard: [mobile_unittests][android_tier_1] → [see comment 729][mobile_unittests][android_tier_1]
(In reply to TBPL Robot from comment #1069)
> edmorley
> https://tbpl.mozilla.org/php/getParsedLog.php?id=28557273&tree=Mozilla-
> Inbound
> Android 4.0 Panda mozilla-inbound talos remote-tcanvasmark on 2013-09-30
> 06:53:50
> revision: b4a1b2767689
> slave: panda-0732
> 
> Traceback (most recent call last):
> raise talosError("timeout exceeded")
> talosError: 'timeout exceeded'
> 09-30 07:02:35.195 E/GeckoConsole( 2076): [JavaScript Error: "TypeError:
> this.browser.stop is not a function" {file:
> "chrome://browser/content/browser.js" line: 2615}]
> 09-30 07:02:35.203 E/GeckoConsole( 2076): [JavaScript Error: "TypeError:
> aTab is null" {file: "chrome://browser/content/browser.js" line: 913}]
> 09-30 07:02:35.203 E/GeckoConsole( 2076): [JavaScript Error: "TypeError:
> this.browser.contentWindow is undefined" {file:
> "chrome://browser/content/browser.js" line: 2817}]
> Return code: 2
> Failed talos ['python', '/builds/panda-0732/talos-data/talos/run_tests.py',
> '--noisy', 'local.yml', '--datazilla-url',
> 'https://datazilla.mozilla.org/talos', '--authfile',
> '/builds/panda-0732/test/oauth.txt'] command run.
> Running post_fatal callback...
> Exiting -1

Joel, are these JS exceptions related to the timeout?
Flags: needinfo?(jmaher)
these are not related, they are coming from inside the browser and there is nothing in talos or the tests which reference this stuff.  I could be wrong.

the problem with tcanvasmark on android 4.0 (panda) is that we seem to consume all the resources and restart vital system resources including networking and the agent.  This means we need to reboot the device to reconnect which isn't a good thing.
Flags: needinfo?(jmaher)
David, why were the tracking flags changed here? :-)
Flags: needinfo?(dkl)
(Or more, is this a bug in the new tracking flags implementation?)
And we shall call him... bug 926118.
(In reply to Ed Morley [:edmorley UTC+1] from comment #1132)
> (Or more, is this a bug in the new tracking flags implementation?)

It was a bug yes that cause the flags to be cleared initially. I just added them back now that the fix is on production.

dkl
Flags: needinfo?(dkl)
Thank you :-)
Closing bugs where TBPLbot has previously commented, but have now not been modified for >3 months & do not contain the whiteboard strings for disabled/annotated tests or use the keyword leave-open. Filter on: mass-intermittent-bug-closure-2014-07
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.