Closed
Bug 793961
Opened 12 years ago
Closed 11 years ago
Investigation of disappearing devices
Categories
(Testing Graveyard :: Autophone, defect)
Testing Graveyard
Autophone
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mcote, Unassigned)
Details
Attachments
(2 files)
This is a tracking bug for the intermittent Autophone-device disconnections. There may well be several things going on here, but this bug can be a basis for investigation from which other bugs are spun off. Over the weekend, the staging autophone server lost the droid pro. A ping sent at 2012-09-23 11:18 failed to get a response. After all the standard timeouts and attempts to reconnect, the worker process died around 2012-09-23 12:20 EDT, and autophone disabled the phone. Last bit of worker stdout: RESULTS 98_4b_4a_13_01_84_droid_pro 2012-09-22 00:34:14:Start Time: 1354757 Thro bber Start: 1355822 Throbber Stop: 1359765 EndDraw: 0 Error receiving data from socket. cmd={'cmd': 'cd /mnt/sdcard/tests'}; err=[Errn o 54] Connection reset by peer - possible reboot Could not connect; sleeping for 5 seconds. reconnecting socket Error receiving data from socket. cmd={'cmd': 'cd /mnt/sdcard/tests'}; err=[Errn o 54] Connection reset by peer - possible reboot Could not connect; sleeping for 5 seconds. reconnecting socket Error receiving data from socket. cmd={'cmd': 'cd /mnt/sdcard/tests'}; err=[Errn o 65] No route to host Could not connect; sleeping for 5 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 10 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 15 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 20 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 25 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 30 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 35 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 5 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 10 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 15 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 20 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 25 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 30 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 35 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 5 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 10 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 15 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 20 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 25 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 30 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 35 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 5 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 10 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 15 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 20 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 25 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 30 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 35 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 5 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 10 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 15 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 20 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 25 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 30 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Could not connect; sleeping for 35 seconds. reconnecting socket unable to connect socket: [Errno 60] Operation timed out Process Process-2: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multipro cessing/process.py", line 258, in _bootstrap self.run() File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multipro cessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/Users/mozilla/autophone/src/autophone/worker.py", line 319, in loop self.recover_phone() File "/Users/mozilla/autophone/src/autophone/worker.py", line 203, in recover_ phone success = self.dm.reboot(self.ipaddr, 30000+self.worker_num) File "/Users/mozilla/autophone/src/autophone/worker.py", line 130, in dm retrylimit=8) File "/Users/mozilla/autophone/src/mozbase/mozdevice/mozdevice/devicemanagerSU T.py", line 55, in __init__ raise BaseException("Failed to connect to SUT Agent and retrieve the device root.") BaseException: Failed to connect to SUT Agent and retrieve the device root. 2012-09-23T12:20:31 Worker starting up.
Assignee | ||
Comment 1•12 years ago
|
||
Last bit of the worker log: 2012-09-23 10:48:37,322|INFO|Pinging phone 2012-09-23 10:48:37,495|INFO|Pong! 2012-09-23 11:03:37,547|INFO|Pinging phone 2012-09-23 11:03:37,690|INFO|Pong! 2012-09-23 11:18:37,741|INFO|Pinging phone 2012-09-23 11:55:56,011|INFO|No response! 2012-09-23 11:55:56,011|INFO|Ping unanswered. 2012-09-23 11:55:56,011|INFO|Rebooting phone... 2012-09-23 12:08:13,127|INFO|Phone did not reboot successfully. 2012-09-23 12:08:13,127|INFO|Rebooting phone... 2012-09-23 12:08:13,127|INFO|Connecting to 192.168.1.129:20701... 2012-09-23 12:20:31,935|INFO|Worker for phone 98_4b_4a_13_01_84_droid_pro starti ng up.
Assignee | ||
Comment 2•12 years ago
|
||
Status of prod server: 3/9 phones running: the two gs2_4.03s and _af_9a_nexus_s. All have reported consistent results aside from a gap from Sept 18-21 for the gs2s. nexus_one had reported results since Sept 14, but worker process died. Last result Sept 22. Droid pro, both samsung_gs2s, and the other two nexus_s reported no results since last major system update with new phones (Sept 12). nexus_one: Worker process had died. The exception was Process Process-5: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 231, in _bootstrap self.run() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run self._target(*self._args, **self._kwargs) File "/Users/mozauto/autophone/worker.py", line 367, in loop t.runjob(job) File "/Users/mozauto/autophone/tests/s1s2test.py", line 55, in runjob starttime = self.dm.getInfo('uptimemillis')['uptimemillis'][0] KeyError: 'uptimemillis' The phone had completed a run about 10 hours before this occurred. Unfortunately the debug level was not turned up. All the pings were successful. It got a job at 2012-09-22 16:24:41 and reported a successful reboot, although the stdout was full of "Connection reset by peer - possible reboot" with 5-second sleeps. This is strange, since the timeout should have been going up. The number of retries was far more than the retrylimit as well. There are no errors in the log at this point. Finally a reboot succeeds somehow, and following takes place: Testfile push succeeded. Push of build.apk failed. Lots of reconnect attempts, finally failed - "Error installing app:" Profile installation continues (!) All timeouts trying to push, but pushdir continues. We then try to push s1s2 HTML files Finally exception when trying to info uptimemillis (KeyError, since we didn't get a response from info). Restarted in disabled mode. The SUT agent on the device was NOT up when I tried to connect to it. "Host is down" occurred when I tried to ping it. So it looks like this is a full device failure of some sort. One big question is why we continued to attempt the s1s2 test when it was clear from early on (during build installation) that the device was busted.
Assignee | ||
Comment 3•12 years ago
|
||
Also, the htc_sensation_4g never reported any results. _8d_samsung_gs2 is not registered at all (still disconnected, I think). However the only phones actually marked as disabled are the _e1_samsung_gs2, the nexus_one, the droid_pro, and the htc_sensation_4g. The 3 nexus_s and two gs2_4.03 are all reporting idle and appear to have run tests. However two of the nexus_s appear to not be launching fennec successfully.
Assignee | ||
Comment 4•12 years ago
|
||
Assignee: nobody → mcote
Status: NEW → ASSIGNED
Assignee | ||
Comment 5•12 years ago
|
||
Assignee | ||
Comment 6•12 years ago
|
||
The droid_pro and the htc_sensation_4g are failing to reboot. sensation: $ nc 10.250.120.247 20701 $>rebt not permited! $> droid_pro: $ nc 10.250.120.248 20701 $>info uptime 7 days 4 hours 15 minutes 8 seconds 485 ms $>rebt $>Permission denied but at this point, the connection hangs (agent frozen?). I can reconnect (though once it took a few tries to re-establish the connection), but the uptime indicates that the it hasn't rebooted: $ nc 10.250.120.248 20701 $>info uptime 7 days 4 hours 17 minutes 42 seconds 501 ms I'm guessing that the devices aren't rooted correctly.
Assignee | ||
Comment 7•12 years ago
|
||
The two nexus_ss that aren't reporting results are failing to launch fennec: 2012-09-14 18:38:01,469|DEBUG|running fennec 2012-09-14 18:38:02,004|DEBUG|Starting: Intent { act=android.intent.action.VIEW dat=/mnt/sdcard/tests/autophone/profile' } Error: Activity not started, unable to resolve Intent { act=android.intent.action.VIEW dat=/mnt/sdcard/tests/autophone/profile' flg=0x10000000 } ^@ We aren't detecting this, though, and just failing to find any of the events in the logcat, so we're sending all 0s to the server. There is no error reported when installing the build. The command looks different on the failing devices, compared to the nexus_s that is working. Working nexus_s (78_d6_f0_89_af_9a_nexus_s): 2012-09-14 18:38:53,946|DEBUG|Starting: Intent { act=android.intent.action.VIEW dat=file://mnt/sdcard/s1test/Twitter2.html cmp=org.mozilla.fennec/.App (has extras) } Failing nexus_s (78_d6_f0_cf_8d_11_nexus_s): Starting: Intent { act=android.intent.action.VIEW dat=/mnt/sdcard/tests/autophone/profile' } Indeed, the command line on the gs2_4.03s matches the working nexus_s.
Assignee | ||
Comment 8•12 years ago
|
||
The gs2 (04_46_65_fd_2f_e1_samsung_gs2) is also failing to start fennec with exactly the same symptoms as above. I've turned up the debug level to 4 on all 3 phones to see what the exact command is. The gs2 also had occasional problems rebooting, which is what led to it being disabled. The stdout file has no record of the rebt command being sent, though, so it looks like the connection was dropped (agent crash?). Yet the log says that it connects but can't reboot. And for some reason a ping occurs after the phone is disabled, immediately after the notification was sent. Log: 2012-09-21 10:28:32,318|INFO|RESULTS: 04_46_65_fd_2f_e1_samsung_gs2:Start Time: 2134989 Throbber Start: 0 Throbber Stop: 0 EndDraw: 0 2012-09-21 10:28:32,566|INFO|Job completed. 2012-09-21 10:28:32,566|INFO|Got job. 2012-09-21 10:28:32,567|INFO|Rebooting... 2012-09-21 10:28:32,567|INFO|Rebooting phone... 2012-09-21 10:28:32,692|INFO|Phone did not reboot successfully. 2012-09-21 10:28:32,692|INFO|Rebooting phone... 2012-09-21 10:28:32,692|INFO|Connecting to 10.250.120.67:20701... 2012-09-21 10:28:32,899|INFO|Connected. 2012-09-21 10:28:32,938|INFO|Phone did not reboot successfully. 2012-09-21 10:28:32,939|INFO|Rebooting phone... 2012-09-21 10:28:32,939|INFO|Connecting to 10.250.120.67:20701... 2012-09-21 10:28:33,048|INFO|Connected. 2012-09-21 10:28:33,099|INFO|Phone did not reboot successfully. 2012-09-21 10:28:33,100|INFO|Phone has been rebooted 3 times; giving up. 2012-09-21 10:28:33,100|INFO|Disabling phone: Phone was rebooted 3 times.. 2012-09-21 10:28:33,100|INFO|Sending notification... 2012-09-21 10:28:33,405|INFO|Sent. 2012-09-21 10:28:33,406|INFO|Phone is disabled; queuing job for later. 2012-09-21 10:28:33,406|INFO|Got job. 2012-09-21 10:28:33,406|INFO|Phone is disabled; queuing job for later. 2012-09-21 10:28:33,407|INFO|Pinging phone 2012-09-21 10:28:33,407|INFO|Connecting to 10.250.120.67:20701... 2012-09-21 10:28:33,631|INFO|Connected. 2012-09-21 10:28:33,694|INFO|Pong! 2012-09-21 12:35:23,768|INFO|Got job. 2012-09-21 12:35:23,768|INFO|Phone is disabled; queuing job for later. out: RESULTS 04_46_65_fd_2f_e1_samsung_gs2 2012-09-21 03:06:01:Start Time: 2134989 Throbber Start: 0 Throbber Stop: 0 EndDraw: 0 reconnecting socket reconnecting socket reconnecting socket sent cmd: testroot recv'ing... response: /mnt/sdcard $>^@ sent cmd: cd /mnt/sdcard/tests recv'ing... response: $>^@ sent cmd: cwd recv'ing... response: /mnt/sdcard/tests $>^@ sent cmd: ver recv'ing... response: SUTAgentAndroid Version 1.04 $>^@ sent cmd: cd /mnt/sdcard/tests recv'ing... response: $>^@ sent cmd: cwd recv'ing... response: /mnt/sdcard/tests $>^@
Assignee | ||
Comment 9•12 years ago
|
||
The command lines used to start fennec appear to be the same on both the working and failing nexus_s: sent cmd: exec am start -a android.intent.action.VIEW -n org.mozilla.fennec/.App --es args '--profile /mnt/sdcard/tests/autophone/profile' -d http://10.250.50.1 62/Twitter2.html recv'ing... response: Starting: Intent { act=android.intent.action.VIEW dat=http://10.250.50 .162/Twitter2.html cmp=org.mozilla.fennec/.App (has extras) } recv'ing... response: return code [0] sent cmd: exec am start -a android.intent.action.VIEW -n org.mozilla.fennec/.App --es args '--profile /mnt/sdcard/tests/autophone/profile' -d http://10.250.50.1 62/Twitter2.html recv'ing... response: Starting: Intent { act=android.intent.action.VIEW dat=/mnt/sdcard/test s/autophone/profile' } recv'ing... response: Error: Activity not started, unable to resolve Intent { act=android.in tent.action.VIEW dat=/mnt/sdcard/tests/autophone/profile' flg=0x10000000 } recv'ing... response: return code [0] The user.js and prefs.js files are identical aside from the comments.
Assignee | ||
Comment 10•12 years ago
|
||
Easy answer to the fennec-starting problems: those devices had an old SUTAgent (1.04) with known problems in 'exec'. Upgrading them appears to have fixed the problem.
Assignee | ||
Comment 11•11 years ago
|
||
We've made a lot of progress on this front, and we have the phones in front of us, so resolving this.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•2 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•