Closed Bug 1028772 Opened 11 years ago Closed 11 years ago

Tegra recovery - [chunk A]

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Unassigned)

References

Details

Over the last month we (philor) have disabled 64 tegras. These 64 tegras were all running jobs pretty well but in waves they suddenly failed to actually take jobs. These waves happened about once a week or bi-weekly and we would lose anywhere from a dozen to a few dozen over the course of a day or two. I would imagine the solution to get them back in prod will be somewhat universal as the watcher.log's show similar output for all of them. Hence this bug. I don't know (1) what happened to make them all fall off the map or (2) how to fix them. They *appear* to have been restarted many a times as the logs show countless powercycles, and, since they were running jobs fine, I don't think they need a reformat/re-image. This seems to be some sort of connection issue. The general log output, for tegras that still have a log recording of the error before they filled up with endless (slaverebooter?) reboot output: 06/17/2014 04:25:05: DEBUG: 30056 ? Sl 0:05 /builds/tegra-245/test/build/hostutils/bin/ssltunnel /tmp/ssltunnelXDsM26.cfg 180 06/17/2014 04:25:05: DEBUG: 31519 ? S 5:20 /usr/bin/python /builds/tegra-246/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/tegra-246/test/ build/tests/mochitest -l /builds/tegra-246/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir 181 06/17/2014 04:25:05: DEBUG: 31521 ? Sl 0:08 /builds/tegra-246/test/build/hostutils/bin/ssltunnel /tmp/ssltunnelSjebF0.cfg 182 06/17/2014 04:25:05: INFO: No mozpool server in this VLAN 183 06/17/2014 04:25:05: INFO: Unable to determine state from Mozpool, falling back to device checks 184 06/17/2014 04:25:05: INFO: INFO: attempting to ping device 185 06/17/2014 04:25:05: DEBUG: calling [ping -c 5 tegra-234] 186 06/17/2014 04:25:09: INFO: Connecting to: tegra-234 187 06/17/2014 04:25:59: INFO: INFO: Unable to connect to device after 1 try 188 06/17/2014 04:25:59: INFO: We're going to sleep for 90 seconds 189 06/17/2014 04:27:29: INFO: Connecting to: tegra-234 190 06/17/2014 04:28:20: INFO: INFO: Unable to connect to device after 2 try 191 06/17/2014 04:28:20: INFO: We're going to sleep for 90 seconds 192 06/17/2014 04:29:50: INFO: Connecting to: tegra-234 193 2014-06-17 04:30:01 -- *** ERROR *** failed to aquire lockfile 194 06/17/2014 04:30:40: INFO: INFO: Unable to connect to device after 3 try 195 06/17/2014 04:30:40: INFO: We're going to sleep for 90 seconds 196 06/17/2014 04:32:10: INFO: Connecting to: tegra-234 197 06/17/2014 04:33:00: INFO: INFO: Unable to connect to device after 4 try 198 06/17/2014 04:33:00: INFO: We're going to sleep for 90 seconds 199 06/17/2014 04:34:30: INFO: Connecting to: tegra-234 200 2014-06-17 04:35:01 -- *** ERROR *** failed to aquire lockfile 201 06/17/2014 04:35:21: INFO: /builds/tegra-234/error.flg 202 06/17/2014 04:35:51: INFO: verifyDevice: failing to telnet 203 reconnecting socket 204 Could not connect; sleeping for 5 seconds. 205 reconnecting socket this is in line with tegras I found last time on buildduty: https://bugzilla.mozilla.org/show_bug.cgi?id=740440#c18 This looks to match up with pete's findings: https://bugzilla.mozilla.org/show_bug.cgi?id=1018118#c4 So if this bug is the solution: "Bug 1010173 - test root internal variable on devices (SUTAgentAndroid.sTestRoot) should not be set as an error message", \o/ but that probably needs to be looked at again soon. Maybe there is a 'quick fix' we can apply here. It be nice to get 64 healthy tegras back in prod all at once. Callek, pete - do you know anything we can do manually to get these back? It looks like many of them are complaining about failing to telnet: http://mxr.mozilla.org/build/source/tools/sut_tools/verify.py#373. I'm not sure if a slow_reboot will help like we did for canPing() as, like I mentioned above, they *seem* to be restarting often.
needinfo WRT comment 0 ^
Flags: needinfo?(pmoore)
Flags: needinfo?(bugspam.Callek)
(In reply to Jordan Lund (:jlund) from comment #1) > needinfo WRT comment 0 ^ So I'm lobbing this chunk/bug over the fence at DCOps for reimage of the affected tegras. I'm not really sure whats going on here, I can't telnet into many of the devices, the few of the disabled that I can, I can't, seem to comprehend whats wrong (things "look" ok) I can't dive into the devices that are down with adb (we don't run adb by default) either :( All of the affected ones I can happily reboot via PDU, so that doesn't sound like the issue. ============ for DCOps: ============ Once this batch is done,I suspect we'll have another 60 or so, within a week for you guys fwiw. Priority wise these are less important than move-train stuff for pandas, and less important than windows test machine recovery (the t-{w732,w864,xp32}-* hosts), but more important than any other "Unreachable" bug.
Assignee: nobody → server-ops-dcops
Blocks: tegra-146
Component: Buildduty → Server Operations: DCOps
Flags: needinfo?(pmoore)
Flags: needinfo?(bugspam.Callek)
Product: Release Engineering → mozilla.org
QA Contact: bugspam.Callek → dmoore
Summary: Tegra recovery - multiple tegras stopped taking jobs over the last month → Tegra recovery - [chunk A]
Version: unspecified → other
tegras in chunk A done.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Blocks: 1031502
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.