Closed Bug 822321 Opened 13 years ago Closed 12 years ago

Intermittent Panda "Could not connect; sleeping for 5 seconds. reconnecting socket" in runtestsremote.py, then "Remote Device Error: unable to connect socket: [Errno 111] Connection refused" during reboot device step

Categories

(Testing :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: emorley, Unassigned)

References

Details

(Keywords: intermittent-failure)

Android 4.0 Panda mozilla-central opt test mochitest-8 on 2012-12-17 06:52:50 PST for push ba26dc1c6267 slave: panda-0856 https://tbpl.mozilla.org/php/getParsedLog.php?id=18016450&tree=Firefox { 12/17/2012 07:00:22: INFO: forcing device panda-0856 reboot 12/17/2012 07:00:22: INFO: Calling PDU powercycle for panda-0856, panda-relay-077.p9.releng.scl1.mozilla.com:2:1 12/17/2012 07:00:23: INFO: Waiting for device to come back... 12/17/2012 07:01:53: INFO: Try 1 reconnecting socket Remote Device Error: unable to connect socket: [Errno 111] Connection refused Could not connect; sleeping for 5 seconds. reconnecting socket sent cmd: testroot recv'ing... response: /mnt/sdcard $> sent cmd: cd /mnt/sdcard/tests recv'ing... response: $> sent cmd: cwd recv'ing... response: /mnt/sdcard/tests $> 12/17/2012 07:01:58: INFO: devroot /mnt/sdcard/tests 12/17/2012 07:01:58: INFO: True program finished with exit code 0 elapsedTime=96.456906 ========= Finished Reboot Device (results: 0, elapsed: 1 mins, 36 secs) (at 2012-12-17 07:01:58.864448) ========= } Similar to bug 820851: (a) We're not getting a TBPL parsable error at the point of the initial failure (the runtestsremote.py step) (b) We're not marking this as a buildbot RETRY, when we probably should be
in this particular logs case it *looks* like the following happens: * installApp.py ** Installs needed apks ** Issues PDU Reboot ** Attempts to wait for device to come back *** Ends up being speedy and connects right away ** Considers itself good/passed * runtests ** Tries [and fails] to connect to device ** Considers job failed ** Attempts to run logcat after job ** --SUCCEEDS-- at running logcat, indicating the device is truely up at this point ** Moves on because the job was considered failed. * reboot.py ** Issues PDU reboot ** fails to reconnect on the first try *** Prints out "Remote Device Error" which TBPL recognizes and bubbles ** --SUCCEEDS-- to connect on second try ** exits [successfully, though that doesn't matter for this script/part-of-jobs]
1) reboot.py should fix the "Remote Device Error" messages. 2) should installApp.py do a better job of ensuring the device has come back online? 3) should runtests.py have a better retry mechanism by default to connect? We recently fixed this for inside the test, but we could probably tweak it a bit for initial connection.
Depends on: 825460
Callek, the last week or so this has been pretty prominent on OrangeFactor. Are you ok being point on this bug? :-)
Flags: needinfo?(bugspam.Callek)
(In reply to Ed Morley (Away 20th Dec-2nd Jan) [UTC+0; email:edmorley@moco] from comment #15) > Callek, the last week or so this has been pretty prominent on OrangeFactor. > Are you ok being point on this bug? :-) Indeed I am, my dependant bug (once deployed) *should* cut this down to almost non-existant. At least the "Remote Device Error: unable to connect socket: [Errno 111] Connection refused" part
Flags: needinfo?(bugspam.Callek)
(In reply to Justin Wood (:Callek) from comment #16) > Indeed I am, my dependant bug (once deployed) *should* cut this down to > almost non-existant. > > At least the "Remote Device Error: unable to connect socket: [Errno 111] > Connection refused" part Ah great! Hadn't gotten that far through the bugmail yet :-)
(In reply to Justin Wood (:Callek) from comment #16) > (In reply to Ed Morley (Away 20th Dec-2nd Jan) [UTC+0; email:edmorley@moco] > from comment #15) > > Callek, the last week or so this has been pretty prominent on OrangeFactor. > > Are you ok being point on this bug? :-) > > Indeed I am, my dependant bug (once deployed) *should* cut this down to > almost non-existant. > > At least the "Remote Device Error: unable to connect socket: [Errno 111] > Connection refused" part We're still getting this after the deploy. Don't suppose you can take another look? Just keen for Android 4.0 to not become the next Android thing that devs learn to ignore :-)
Flags: needinfo?(bugspam.Callek)
Any joy with this Callek? :-)
Depends on: 837356
(In reply to Ed Morley [:edmorley UTC+0] from comment #44) > (In reply to Justin Wood (:Callek) from comment #16) > > (In reply to Ed Morley (Away 20th Dec-2nd Jan) [UTC+0; email:edmorley@moco] > > from comment #15) > > > Callek, the last week or so this has been pretty prominent on OrangeFactor. > > > Are you ok being point on this bug? :-) > > > > Indeed I am, my dependant bug (once deployed) *should* cut this down to > > almost non-existant. > > > > At least the "Remote Device Error: unable to connect socket: [Errno 111] > > Connection refused" part > > We're still getting this after the deploy. > Don't suppose you can take another look? > Just keen for Android 4.0 to not become the next Android thing that devs > learn to ignore :-) Callek? You've been ignoring the needinfo on this for 2 weeks! :P
(In reply to Ed Morley (Away until 13th March) [:edmorley UTC+0] from comment #1105) > (In reply to Ed Morley [:edmorley UTC+0] from comment #44) > > (In reply to Justin Wood (:Callek) from comment #16) > > > (In reply to Ed Morley (Away 20th Dec-2nd Jan) [UTC+0; email:edmorley@moco] > > > from comment #15) > > > > Callek, the last week or so this has been pretty prominent on OrangeFactor. > > > > Are you ok being point on this bug? :-) > > > > > > Indeed I am, my dependant bug (once deployed) *should* cut this down to > > > almost non-existant. > > > > > > At least the "Remote Device Error: unable to connect socket: [Errno 111] > > > Connection refused" part > > > > We're still getting this after the deploy. > > Don't suppose you can take another look? > > Just keen for Android 4.0 to not become the next Android thing that devs > > learn to ignore :-) > > Callek? > > You've been ignoring the needinfo on this for 2 weeks! :P Sorry! (totally missed it amid all my SeaMonkey stuff I've been ignoring for longer) I'm stock out of the good ideas on here for now, perhaps once we start cleaning up the technical debt in this code we'll find an obvious issue. That said, I know joel has seen signs of when pandas reboot due to thermal issues. So maybe he has ideas? I note we are still a lot better than we were a month or two ago (but still nowhere near where we want to be) punting to joel incase he has good ideas.
Flags: needinfo?(bugspam.Callek) → needinfo?(jmaher)
this is yet another bug where we reboot in the middle of a step during a test run. We are looking into this, but it could be another month or two. I have been investigating temperature issues (which I am thinking is not an issue), memory, cpu starvation, kernel panics, and relay board reboot issues. In total there are a lot of factors to investigate and running the tests on these take a long time. Some basic stats: * ~7.5% total failure rate on the pandas * I estimate 3-4% of that is due to reboots in the middle of a test We are working on this general problem each day. When we fix a problem causing reboots in the middle of the tests and can prove it lowers the overall failure rate, I would like to revisit all these bugs for panda boards and connection/reboots.
Flags: needinfo?(jmaher)
(In reply to Justin Wood (:Callek) from comment #1324) > > You've been ignoring the needinfo on this for 2 weeks! :P > > Sorry! (totally missed it amid all my SeaMonkey stuff I've been ignoring for > longer) Perhaps set up a filter on |X-Bugzilla-Type = request|, so that reviews, feedback and needinfos don't get buried by everything else? :-) (I find this approach works well for me) (In reply to Joel Maher (:jmaher) from comment #1327) > this is yet another bug where we reboot in the middle of a step during a > test run. We are looking into this, but it could be another month or two. > I have been investigating temperature issues (which I am thinking is not an > issue), memory, cpu starvation, kernel panics, and relay board reboot > issues. In total there are a lot of factors to investigate and running the > tests on these take a long time. > > Some basic stats: > * ~7.5% total failure rate on the pandas > * I estimate 3-4% of that is due to reboots in the middle of a test > > We are working on this general problem each day. When we fix a problem > causing reboots in the middle of the tests and can prove it lowers the > overall failure rate, I would like to revisit all these bugs for panda > boards and connection/reboots. Thank you for the update Joel :-)
Information from etherpad regarding Android releng crashes * many failures here are test harness failing during shutdown (nssDestroy, etc...) * some failures here are failure to execute command on the panda, usually followed by an unexpected reboot * all these failures do a reboot as the last step and never reconnect * how come we are not trying a second pdu reboot? * about 875 instances in the last 2 months
Depends on: 870378
(OrangeWFM bugs not modified in > 2 months)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.