Closed
Bug 752222
Opened 14 years ago
Closed 10 years ago
trobo hangs occassionally...
Categories
(Testing :: Talos, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: Callek, Unassigned)
References
Details
(Whiteboard: talos-android)
So, I have very little info on this problem, but I did speak with Joel a few times this past week on the general notice of it.
For example, today we had one run of this hanging (tegra-089) for almost 5 hours.
The log from the actual run is at https://tbpl.mozilla.org/php/getParsedLog.php?id=11498545&tree=Firefox however the relevant part imo is:
----
Failed tprovider:
Stopped Sat, 05 May 2012 03:31:38
Traceback (most recent call last):
File "run_tests.py", line 737, in <module>
FAIL: Busted: tprovider
FAIL: timeout exceeded
main()
File "run_tests.py", line 734, in main
test_file(arg, options, parser.parsed)
File "run_tests.py", line 675, in test_file
raise e
utils.talosError: 'timeout exceeded'
reconnecting socket
FIRE PROC: 'am instrument -w -e class org.mozilla.fennec.tests.testBrowserProviderPerf org.mozilla.roboexample.test/android.test.InstrumentationTestRunner'
----
And then the hang. Other interesting parts is that doing a manual kill_stalled.sh on the foopy yeiled no hung procs, but checking ps output had 3 bcontroller.py's (of varying ages)
cltbld 53869 0.0 0.2 2456736 10464 ?? S 3:19AM 0:00.38 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /builds/tegra-089/talos-data/talos/bcontroller.py --configFile /builds/tegra-089/talos-data/talos/bcontroller.yml
cltbld 21733 0.0 0.8 2477472 35220 ?? S 9:39AM 1:33.75 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin/twistd --no_save --rundir=/builds/tegra-089 --pidfile=/builds/tegra-089/twistd.pid --python=/builds/tegra-089/buildbot.tac
cltbld 56044 0.0 0.2 2456736 10460 ?? S Mon05AM 0:00.38 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /builds/tegra-089/talos-data/talos/bcontroller.py --configFile /builds/tegra-089/talos-data/talos/bcontroller.yml
cltbld 11718 0.0 0.2 2456736 10460 ?? S 26Apr12 0:00.37 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /builds/tegra-089/talos-data/talos/bcontroller.py --configFile /builds/tegra-089/talos-data/talos/bcontroller.yml
cltbld 99725 0.0 0.1 2446768 2820 ?? S 23Apr12 1:28.96 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python clientproxy.py -b --tegra=tegra-089
cltbld 99724 0.0 0.1 2456764 3716 ?? S 23Apr12 0:21.06 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python clientproxy.py -b --tegra=tegra-089
---------
So *something* is not letting bcontroller.py exit properly it seems, and that is likely interfering with this test, aiui, and possibly other tests!?
Joel can you help get someone on this issue, and feel free to poke me for assistance in digging into it.
Comment 1•14 years ago
|
||
Rather quick, as these go:
https://tbpl.mozilla.org/php/getParsedLog.php?id=11502632&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound talos remote-trobocheck on 2012-05-05 10:10:25 PDT for push 07f84eae606e
FAIL: timeout exceeded
reconnecting socket
FIRE PROC: 'am instrument -w -e class org.mozilla.fennec.tests.testCheck org.mozilla.roboexample.test/android.test.InstrumentationTestRunner'
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
========= Finished 'python run_tests.py ...' interrupted (results: 4, elapsed: 2 hrs, 2 mins, 42 secs) (at 2012-05-05 12:27:22.844314) =========
Comment 2•14 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=11516136&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound talos remote-troboprovider on 2012-05-06 05:36:24 PDT for push 4ba9cc4ee095
elapsed: 10 hrs, 25 mins, 2 secs
Comment 3•14 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=11516604&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound talos remote-trobocheck2 on 2012-05-06 05:36:34 PDT for push 4ba9cc4ee095
elapsed: 10 hrs, 58 mins, 44 secs
Comment 4•14 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=11516713&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound talos remote-trobocheck on 2012-05-06 05:36:34 PDT for push 4ba9cc4ee095
elapsed: 10 hrs, 58 mins, 32 secs
Comment 5•14 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=11528141&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound talos remote-trobocheck2 on 2012-05-06 23:09:44 PDT for push 929610b0c428
elapsed: 3 hrs, 41 mins, 0 secs
https://tbpl.mozilla.org/php/getParsedLog.php?id=11528136&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound talos remote-trobocheck2 on 2012-05-06 22:35:52 PDT for push 33168c4c4703
elapsed: 4 hrs, 24 mins, 51 secs
Comment 6•14 years ago
|
||
And since I don't often go back two days to bring you news of the truly awful ones, right now 5.88% of our working Tegra pool is hung doing trobo* jobs, ranging from 2 hours 44 minutes in to 1 day 20 hours and 40 minutes in.
Comment 7•14 years ago
|
||
callek has just been "read into" the foopy cabal so we will be working tomorrow morning on a way of detecting these has stalled jobs and removing them.
this should allow the tests to remain and have them show up as oranges like they should
Comment 8•14 years ago
|
||
Is there possibly also a problem in the robocop tests themselves, in addition to the bcontroller issue? The log in Comment 1 suggests to me that the test was launched but never ended.
Comment 9•14 years ago
|
||
yeah, there is a chance of that. I have seen it once or twice locally. It seems that the test starts fine and we get through 1 or more iterations, but then it dies. When it dies it looks like it fails to connect to the device as the primary cause (which could be a side effect of foopies, etc...)
Updated•10 years ago
|
Whiteboard: talos-android
Comment 10•10 years ago
|
||
moving the remaining android talos tests to autophone this quarter, autophone is more robust in device management and retrying, most likely we will not see this issue there.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•