Closed Bug 1028772 Opened 10 years ago Closed 10 years ago

Tegra recovery - [chunk A]

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Unassigned)

References

Details

Over the last month we (philor) have disabled 64 tegras. These 64 tegras were all running jobs pretty well but in waves they suddenly failed to actually take jobs.

These waves happened about once a week or bi-weekly and we would lose anywhere from a dozen to a few dozen over the course of a day or two.

I would imagine the solution to get them back in prod will be somewhat universal as the watcher.log's show similar output for all of them. Hence this bug.

I don't know (1) what happened to make them all fall off the map or (2) how to fix them. They *appear* to have been restarted many a times as the logs show countless powercycles, and, since they were running jobs fine, I don't think they need a reformat/re-image. This seems to be some sort of connection issue.

The general log output, for tegras that still have a log recording of the error before they filled up with endless (slaverebooter?) reboot output:
06/17/2014 04:25:05: DEBUG: 30056 ?        Sl     0:05 /builds/tegra-245/test/build/hostutils/bin/ssltunnel /tmp/ssltunnelXDsM26.cfg
 180 06/17/2014 04:25:05: DEBUG: 31519 ?        S      5:20 /usr/bin/python /builds/tegra-246/test/build/tests/mochitest/pywebsocket_wrapper.py -p 9988 -w /builds/tegra-246/test/     build/tests/mochitest -l /builds/tegra-246/test/build/tests/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir
 181 06/17/2014 04:25:05: DEBUG: 31521 ?        Sl     0:08 /builds/tegra-246/test/build/hostutils/bin/ssltunnel /tmp/ssltunnelSjebF0.cfg
 182 06/17/2014 04:25:05: INFO: No mozpool server in this VLAN
 183 06/17/2014 04:25:05: INFO: Unable to determine state from Mozpool, falling back to device checks
 184 06/17/2014 04:25:05: INFO: INFO: attempting to ping device
 185 06/17/2014 04:25:05: DEBUG: calling [ping -c 5 tegra-234]
 186 06/17/2014 04:25:09: INFO: Connecting to: tegra-234
 187 06/17/2014 04:25:59: INFO: INFO: Unable to connect to device after 1 try
 188 06/17/2014 04:25:59: INFO: We're going to sleep for 90 seconds
 189 06/17/2014 04:27:29: INFO: Connecting to: tegra-234
 190 06/17/2014 04:28:20: INFO: INFO: Unable to connect to device after 2 try
 191 06/17/2014 04:28:20: INFO: We're going to sleep for 90 seconds
 192 06/17/2014 04:29:50: INFO: Connecting to: tegra-234
 193 2014-06-17 04:30:01 -- *** ERROR *** failed to aquire lockfile
 194 06/17/2014 04:30:40: INFO: INFO: Unable to connect to device after 3 try
 195 06/17/2014 04:30:40: INFO: We're going to sleep for 90 seconds
 196 06/17/2014 04:32:10: INFO: Connecting to: tegra-234
 197 06/17/2014 04:33:00: INFO: INFO: Unable to connect to device after 4 try
 198 06/17/2014 04:33:00: INFO: We're going to sleep for 90 seconds
 199 06/17/2014 04:34:30: INFO: Connecting to: tegra-234
 200 2014-06-17 04:35:01 -- *** ERROR *** failed to aquire lockfile
 201 06/17/2014 04:35:21: INFO: /builds/tegra-234/error.flg
 202 06/17/2014 04:35:51: INFO: verifyDevice: failing to telnet
 203 reconnecting socket
 204 Could not connect; sleeping for 5 seconds.
 205 reconnecting socket

this is in line with tegras I found last time on buildduty: https://bugzilla.mozilla.org/show_bug.cgi?id=740440#c18

This looks to match up with pete's findings: https://bugzilla.mozilla.org/show_bug.cgi?id=1018118#c4

So if this bug is the solution: "Bug 1010173 - test root internal variable on devices (SUTAgentAndroid.sTestRoot) should not be set as an error message", \o/ but that probably needs to be looked at again soon. Maybe there is a 'quick fix' we can apply here. It be nice to get 64 healthy tegras back in prod all at once.

Callek, pete - do you know anything we can do manually to get these back? It looks like many of them are complaining about failing to telnet: http://mxr.mozilla.org/build/source/tools/sut_tools/verify.py#373. I'm not sure if a slow_reboot will help like we did for canPing() as, like I mentioned above, they *seem* to be restarting often.
needinfo WRT comment 0 ^
Flags: needinfo?(pmoore)
Flags: needinfo?(bugspam.Callek)
(In reply to Jordan Lund (:jlund) from comment #1)
> needinfo WRT comment 0 ^

So I'm lobbing this chunk/bug over the fence at DCOps for reimage of the affected tegras.

I'm not really sure whats going on here, I can't telnet into many of the devices, the few of the disabled that I can, I can't, seem to comprehend whats wrong (things "look" ok)

I can't dive into the devices that are down with adb (we don't run adb by default) either :(

All of the affected ones I can happily reboot via PDU, so that doesn't sound like the issue.


============
for DCOps:
============

Once this batch is done,I suspect we'll have another 60 or so, within a week for you guys fwiw.

Priority wise these are less important than move-train stuff for pandas, and less important than windows test machine recovery (the t-{w732,w864,xp32}-* hosts), but more important than any other "Unreachable" bug.
Assignee: nobody → server-ops-dcops
Blocks: tegra-146
Component: Buildduty → Server Operations: DCOps
Flags: needinfo?(pmoore)
Flags: needinfo?(bugspam.Callek)
Product: Release Engineering → mozilla.org
QA Contact: bugspam.Callek → dmoore
Summary: Tegra recovery - multiple tegras stopped taking jobs over the last month → Tegra recovery - [chunk A]
Version: unspecified → other
tegras in chunk A done.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 1031502
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.