Investigation of disappearing devices

RESOLVED FIXED

Status

Testing
Autophone
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: mcote, Assigned: mcote)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Assignee)

Description

5 years ago
This is a tracking bug for the intermittent Autophone-device disconnections. There may well be several things going on here, but this bug can be a basis for investigation from which other bugs are spun off.

Over the weekend, the staging autophone server lost the droid pro. A ping sent at 2012-09-23 11:18 failed to get a response. After all the standard timeouts and attempts to reconnect, the worker process died around 2012-09-23 12:20 EDT, and autophone disabled the phone.

Last bit of worker stdout:


RESULTS 98_4b_4a_13_01_84_droid_pro 2012-09-22 00:34:14:Start Time: 1354757 Thro
bber Start: 1355822 Throbber Stop: 1359765 EndDraw: 0
Error receiving data from socket. cmd={'cmd': 'cd /mnt/sdcard/tests'}; err=[Errn
o 54] Connection reset by peer - possible reboot
Could not connect; sleeping for 5 seconds.
reconnecting socket
Error receiving data from socket. cmd={'cmd': 'cd /mnt/sdcard/tests'}; err=[Errn
o 54] Connection reset by peer - possible reboot
Could not connect; sleeping for 5 seconds.
reconnecting socket
Error receiving data from socket. cmd={'cmd': 'cd /mnt/sdcard/tests'}; err=[Errn
o 65] No route to host
Could not connect; sleeping for 5 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 10 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 15 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 20 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 25 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 30 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 35 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 5 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 10 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 15 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 20 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 25 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 30 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 35 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 5 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 10 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 15 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 20 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 25 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 30 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 35 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 5 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 10 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 15 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 20 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 25 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 30 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 35 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 5 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 10 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 15 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 20 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 25 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 30 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Could not connect; sleeping for 35 seconds.
reconnecting socket
unable to connect socket: [Errno 60] Operation timed out
Process Process-2:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multipro
cessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multipro
cessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mozilla/autophone/src/autophone/worker.py", line 319, in loop
    self.recover_phone()
  File "/Users/mozilla/autophone/src/autophone/worker.py", line 203, in recover_
phone
    success = self.dm.reboot(self.ipaddr, 30000+self.worker_num)
  File "/Users/mozilla/autophone/src/autophone/worker.py", line 130, in dm
    retrylimit=8)
  File "/Users/mozilla/autophone/src/mozbase/mozdevice/mozdevice/devicemanagerSU
T.py", line 55, in __init__
    raise BaseException("Failed to connect to SUT Agent and retrieve the device 
root.")
BaseException: Failed to connect to SUT Agent and retrieve the device root.
2012-09-23T12:20:31 Worker starting up.
(Assignee)

Comment 1

5 years ago
Last bit of the worker log:

2012-09-23 10:48:37,322|INFO|Pinging phone
2012-09-23 10:48:37,495|INFO|Pong!
2012-09-23 11:03:37,547|INFO|Pinging phone
2012-09-23 11:03:37,690|INFO|Pong!
2012-09-23 11:18:37,741|INFO|Pinging phone
2012-09-23 11:55:56,011|INFO|No response!
2012-09-23 11:55:56,011|INFO|Ping unanswered.
2012-09-23 11:55:56,011|INFO|Rebooting phone...
2012-09-23 12:08:13,127|INFO|Phone did not reboot successfully.
2012-09-23 12:08:13,127|INFO|Rebooting phone...
2012-09-23 12:08:13,127|INFO|Connecting to 192.168.1.129:20701...
2012-09-23 12:20:31,935|INFO|Worker for phone 98_4b_4a_13_01_84_droid_pro starti
ng up.
(Assignee)

Comment 2

5 years ago
Status of prod server:

3/9 phones running: the two gs2_4.03s and _af_9a_nexus_s. All have reported consistent results aside from a gap from Sept 18-21 for the gs2s.

nexus_one had reported results since Sept 14, but worker process died. Last result Sept 22.
Droid pro, both samsung_gs2s, and the other two nexus_s reported no results since last major system update with new phones (Sept 12).

nexus_one:

Worker process had died. The exception was

Process Process-5:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 231, in _bootstrap
    self.run()
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mozauto/autophone/worker.py", line 367, in loop
    t.runjob(job)
  File "/Users/mozauto/autophone/tests/s1s2test.py", line 55, in runjob
    starttime = self.dm.getInfo('uptimemillis')['uptimemillis'][0]
KeyError: 'uptimemillis'

The phone had completed a run about 10 hours before this occurred. Unfortunately the debug level was not turned up. All the pings were successful. It got a job at 2012-09-22 16:24:41 and reported a successful reboot, although the stdout was full of "Connection reset by peer - possible reboot" with 5-second sleeps. This is strange, since the timeout should have been going up. The number of retries was far more than the retrylimit as well. There are no errors in the log at this point.

Finally a reboot succeeds somehow, and following takes place:

Testfile push succeeded.
Push of build.apk failed.
Lots of reconnect attempts, finally failed - "Error installing app:"
Profile installation continues (!)
All timeouts trying to push, but pushdir continues.
We then try to push s1s2 HTML files
Finally exception when trying to info uptimemillis (KeyError, since we didn't get a response from info).
Restarted in disabled mode.

The SUT agent on the device was NOT up when I tried to connect to it. "Host is down" occurred when I tried to ping it.

So it looks like this is a full device failure of some sort. One big question is why we continued to attempt the s1s2 test when it was clear from early on (during build installation) that the device was busted.
(Assignee)

Comment 3

5 years ago
Also, the htc_sensation_4g never reported any results. _8d_samsung_gs2 is not registered at all (still disconnected, I think).

However the only phones actually marked as disabled are the _e1_samsung_gs2, the nexus_one, the droid_pro, and the htc_sensation_4g. The 3 nexus_s and two gs2_4.03 are all reporting idle and appear to have run tests. However two of the nexus_s appear to not be launching fennec successfully.
(Assignee)

Comment 4

5 years ago
Created attachment 665208 [details]
nexus_one log starting at the end of the test before failure
Assignee: nobody → mcote
Status: NEW → ASSIGNED
(Assignee)

Comment 5

5 years ago
Created attachment 665209 [details]
nexus_one stdout starting at the end of the test before failure
(Assignee)

Comment 6

5 years ago
The droid_pro and the htc_sensation_4g are failing to reboot.

sensation:

$ nc 10.250.120.247 20701
$>rebt
not permited!
$>

droid_pro:

$ nc 10.250.120.248 20701
$>info uptime
7 days 4 hours 15 minutes 8 seconds 485 ms
$>rebt
$>Permission denied

but at this point, the connection hangs (agent frozen?). I can reconnect (though once it took a few tries to re-establish the connection), but the uptime indicates that the it hasn't rebooted:

$ nc 10.250.120.248 20701
$>info uptime
7 days 4 hours 17 minutes 42 seconds 501 ms

I'm guessing that the devices aren't rooted correctly.
(Assignee)

Comment 7

5 years ago
The two nexus_ss that aren't reporting results are failing to launch fennec:

2012-09-14 18:38:01,469|DEBUG|running fennec
2012-09-14 18:38:02,004|DEBUG|Starting: Intent { act=android.intent.action.VIEW dat=/mnt/sdcard/tests/autophone/profile' }
Error: Activity not started, unable to resolve Intent { act=android.intent.action.VIEW dat=/mnt/sdcard/tests/autophone/profile' flg=0x10000000 }
^@

We aren't detecting this, though, and just failing to find any of the events in the logcat, so we're sending all 0s to the server. There is no error reported when installing the build.

The command looks different on the failing devices, compared to the nexus_s that is working.

Working nexus_s (78_d6_f0_89_af_9a_nexus_s):

2012-09-14 18:38:53,946|DEBUG|Starting: Intent { act=android.intent.action.VIEW dat=file://mnt/sdcard/s1test/Twitter2.html cmp=org.mozilla.fennec/.App (has extras) }

Failing nexus_s (78_d6_f0_cf_8d_11_nexus_s):

Starting: Intent { act=android.intent.action.VIEW dat=/mnt/sdcard/tests/autophone/profile' }

Indeed, the command line on the gs2_4.03s matches the working nexus_s.
(Assignee)

Comment 8

5 years ago
The gs2 (04_46_65_fd_2f_e1_samsung_gs2) is also failing to start fennec with exactly the same symptoms as above. I've turned up the debug level to 4 on all 3 phones to see what the exact command is.

The gs2 also had occasional problems rebooting, which is what led to it being disabled. The stdout file has no record of the rebt command being sent, though, so it looks like the connection was dropped (agent crash?). Yet the log says that it connects but can't reboot. And for some reason a ping occurs after the phone is disabled, immediately after the notification was sent.

Log:

2012-09-21 10:28:32,318|INFO|RESULTS: 04_46_65_fd_2f_e1_samsung_gs2:Start Time: 
2134989 Throbber Start: 0 Throbber Stop: 0 EndDraw: 0
2012-09-21 10:28:32,566|INFO|Job completed.
2012-09-21 10:28:32,566|INFO|Got job.
2012-09-21 10:28:32,567|INFO|Rebooting...
2012-09-21 10:28:32,567|INFO|Rebooting phone...
2012-09-21 10:28:32,692|INFO|Phone did not reboot successfully.
2012-09-21 10:28:32,692|INFO|Rebooting phone...
2012-09-21 10:28:32,692|INFO|Connecting to 10.250.120.67:20701...
2012-09-21 10:28:32,899|INFO|Connected.
2012-09-21 10:28:32,938|INFO|Phone did not reboot successfully.
2012-09-21 10:28:32,939|INFO|Rebooting phone...
2012-09-21 10:28:32,939|INFO|Connecting to 10.250.120.67:20701...
2012-09-21 10:28:33,048|INFO|Connected.
2012-09-21 10:28:33,099|INFO|Phone did not reboot successfully.
2012-09-21 10:28:33,100|INFO|Phone has been rebooted 3 times; giving up.
2012-09-21 10:28:33,100|INFO|Disabling phone: Phone was rebooted 3 times..
2012-09-21 10:28:33,100|INFO|Sending notification...
2012-09-21 10:28:33,405|INFO|Sent.
2012-09-21 10:28:33,406|INFO|Phone is disabled; queuing job for later.
2012-09-21 10:28:33,406|INFO|Got job.
2012-09-21 10:28:33,406|INFO|Phone is disabled; queuing job for later.
2012-09-21 10:28:33,407|INFO|Pinging phone
2012-09-21 10:28:33,407|INFO|Connecting to 10.250.120.67:20701...
2012-09-21 10:28:33,631|INFO|Connected.
2012-09-21 10:28:33,694|INFO|Pong!
2012-09-21 12:35:23,768|INFO|Got job.
2012-09-21 12:35:23,768|INFO|Phone is disabled; queuing job for later.

out:

RESULTS 04_46_65_fd_2f_e1_samsung_gs2 2012-09-21 03:06:01:Start Time: 2134989 Throbber Start: 0 Throbber Stop: 0 EndDraw: 0
reconnecting socket
reconnecting socket
reconnecting socket
sent cmd: testroot
recv'ing...
response: /mnt/sdcard
$>^@
sent cmd: cd /mnt/sdcard/tests
recv'ing...
response: $>^@
sent cmd: cwd
recv'ing...
response: /mnt/sdcard/tests
$>^@
sent cmd: ver
recv'ing...
response: SUTAgentAndroid Version 1.04
$>^@
sent cmd: cd /mnt/sdcard/tests
recv'ing...
response: $>^@
sent cmd: cwd
recv'ing...
response: /mnt/sdcard/tests
$>^@
(Assignee)

Comment 9

5 years ago
The command lines used to start fennec appear to be the same on both the working and failing nexus_s:

sent cmd: exec am start -a android.intent.action.VIEW -n org.mozilla.fennec/.App
 --es args '--profile /mnt/sdcard/tests/autophone/profile' -d http://10.250.50.1
62/Twitter2.html
recv'ing...
response: Starting: Intent { act=android.intent.action.VIEW dat=http://10.250.50
.162/Twitter2.html cmp=org.mozilla.fennec/.App (has extras) }

recv'ing...
response: return code [0]


sent cmd: exec am start -a android.intent.action.VIEW -n org.mozilla.fennec/.App
 --es args '--profile /mnt/sdcard/tests/autophone/profile' -d http://10.250.50.1
62/Twitter2.html
recv'ing...
response: Starting: Intent { act=android.intent.action.VIEW dat=/mnt/sdcard/test
s/autophone/profile' }

recv'ing...
response: Error: Activity not started, unable to resolve Intent { act=android.in
tent.action.VIEW dat=/mnt/sdcard/tests/autophone/profile' flg=0x10000000 }

recv'ing...
response: return code [0]

The user.js and prefs.js files are identical aside from the comments.
(Assignee)

Comment 10

5 years ago
Easy answer to the fennec-starting problems: those devices had an old SUTAgent (1.04) with known problems in 'exec'. Upgrading them appears to have fixed the problem.
(Assignee)

Comment 11

5 years ago
We've made a lot of progress on this front, and we have the phones in front of us, so resolving this.
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.