Closed Bug 752222 Opened 14 years ago Closed 10 years ago

trobo hangs occassionally...

Categories

(Testing :: Talos, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: Callek, Unassigned)

References

Details

(Whiteboard: talos-android)

So, I have very little info on this problem, but I did speak with Joel a few times this past week on the general notice of it. For example, today we had one run of this hanging (tegra-089) for almost 5 hours. The log from the actual run is at https://tbpl.mozilla.org/php/getParsedLog.php?id=11498545&tree=Firefox however the relevant part imo is: ---- Failed tprovider: Stopped Sat, 05 May 2012 03:31:38 Traceback (most recent call last): File "run_tests.py", line 737, in <module> FAIL: Busted: tprovider FAIL: timeout exceeded main() File "run_tests.py", line 734, in main test_file(arg, options, parser.parsed) File "run_tests.py", line 675, in test_file raise e utils.talosError: 'timeout exceeded' reconnecting socket FIRE PROC: 'am instrument -w -e class org.mozilla.fennec.tests.testBrowserProviderPerf org.mozilla.roboexample.test/android.test.InstrumentationTestRunner' ---- And then the hang. Other interesting parts is that doing a manual kill_stalled.sh on the foopy yeiled no hung procs, but checking ps output had 3 bcontroller.py's (of varying ages) cltbld 53869 0.0 0.2 2456736 10464 ?? S 3:19AM 0:00.38 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /builds/tegra-089/talos-data/talos/bcontroller.py --configFile /builds/tegra-089/talos-data/talos/bcontroller.yml cltbld 21733 0.0 0.8 2477472 35220 ?? S 9:39AM 1:33.75 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin/twistd --no_save --rundir=/builds/tegra-089 --pidfile=/builds/tegra-089/twistd.pid --python=/builds/tegra-089/buildbot.tac cltbld 56044 0.0 0.2 2456736 10460 ?? S Mon05AM 0:00.38 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /builds/tegra-089/talos-data/talos/bcontroller.py --configFile /builds/tegra-089/talos-data/talos/bcontroller.yml cltbld 11718 0.0 0.2 2456736 10460 ?? S 26Apr12 0:00.37 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /builds/tegra-089/talos-data/talos/bcontroller.py --configFile /builds/tegra-089/talos-data/talos/bcontroller.yml cltbld 99725 0.0 0.1 2446768 2820 ?? S 23Apr12 1:28.96 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python clientproxy.py -b --tegra=tegra-089 cltbld 99724 0.0 0.1 2456764 3716 ?? S 23Apr12 0:21.06 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python clientproxy.py -b --tegra=tegra-089 --------- So *something* is not letting bcontroller.py exit properly it seems, and that is likely interfering with this test, aiui, and possibly other tests!? Joel can you help get someone on this issue, and feel free to poke me for assistance in digging into it.
Rather quick, as these go: https://tbpl.mozilla.org/php/getParsedLog.php?id=11502632&tree=Mozilla-Inbound Android Tegra 250 mozilla-inbound talos remote-trobocheck on 2012-05-05 10:10:25 PDT for push 07f84eae606e FAIL: timeout exceeded reconnecting socket FIRE PROC: 'am instrument -w -e class org.mozilla.fennec.tests.testCheck org.mozilla.roboexample.test/android.test.InstrumentationTestRunner' remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] ========= Finished 'python run_tests.py ...' interrupted (results: 4, elapsed: 2 hrs, 2 mins, 42 secs) (at 2012-05-05 12:27:22.844314) =========
https://tbpl.mozilla.org/php/getParsedLog.php?id=11516136&tree=Mozilla-Inbound Android Tegra 250 mozilla-inbound talos remote-troboprovider on 2012-05-06 05:36:24 PDT for push 4ba9cc4ee095 elapsed: 10 hrs, 25 mins, 2 secs
https://tbpl.mozilla.org/php/getParsedLog.php?id=11516604&tree=Mozilla-Inbound Android Tegra 250 mozilla-inbound talos remote-trobocheck2 on 2012-05-06 05:36:34 PDT for push 4ba9cc4ee095 elapsed: 10 hrs, 58 mins, 44 secs
https://tbpl.mozilla.org/php/getParsedLog.php?id=11516713&tree=Mozilla-Inbound Android Tegra 250 mozilla-inbound talos remote-trobocheck on 2012-05-06 05:36:34 PDT for push 4ba9cc4ee095 elapsed: 10 hrs, 58 mins, 32 secs
https://tbpl.mozilla.org/php/getParsedLog.php?id=11528141&tree=Mozilla-Inbound Android Tegra 250 mozilla-inbound talos remote-trobocheck2 on 2012-05-06 23:09:44 PDT for push 929610b0c428 elapsed: 3 hrs, 41 mins, 0 secs https://tbpl.mozilla.org/php/getParsedLog.php?id=11528136&tree=Mozilla-Inbound Android Tegra 250 mozilla-inbound talos remote-trobocheck2 on 2012-05-06 22:35:52 PDT for push 33168c4c4703 elapsed: 4 hrs, 24 mins, 51 secs
And since I don't often go back two days to bring you news of the truly awful ones, right now 5.88% of our working Tegra pool is hung doing trobo* jobs, ranging from 2 hours 44 minutes in to 1 day 20 hours and 40 minutes in.
callek has just been "read into" the foopy cabal so we will be working tomorrow morning on a way of detecting these has stalled jobs and removing them. this should allow the tests to remain and have them show up as oranges like they should
Depends on: 752966
Is there possibly also a problem in the robocop tests themselves, in addition to the bcontroller issue? The log in Comment 1 suggests to me that the test was launched but never ended.
yeah, there is a chance of that. I have seen it once or twice locally. It seems that the test starts fine and we get through 1 or more iterations, but then it dies. When it dies it looks like it fails to connect to the device as the primary cause (which could be a side effect of foopies, etc...)
Whiteboard: talos-android
moving the remaining android talos tests to autophone this quarter, autophone is more robust in device management and retrying, most likely we will not see this issue there.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.