Bug 785681 (tegra-349)

tegra-349 problem tracking

RESOLVED FIXED

Status

task
RESOLVED FIXED
7 years ago
Last year

People

(Reporter: philor, Unassigned)

Tracking

Details

(Whiteboard: [buildduty][buildslave][capacity][mobile], )

Reporter

Description

7 years ago
Lifetime, it's over 50% green, but https://secure.pub.build.mozilla.org/buildapi/recent/tegra-349 says it's under 50% for the last 25, including several entries in the death-knell bug 782495. The boy just ain't right.
running stop_cp now
Whiteboard: [buildduty][buildslave][capacity][mobile]
Back in production.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter

Comment 3

7 years ago
https://secure.pub.build.mozilla.org/buildapi/recent/tegra-349

So far today, it has run 13 jobs, of which 2 were green.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: tegra-349 problem tracking → [disable me] tegra-349 problem tracking
Reporter

Comment 4

7 years ago
The good news is that it has taken to failing in auto-RETRY ways. The bad news is that things like https://tbpl.mozilla.org/php/getParsedLog.php?id=15500029&tree=Mozilla-Inbound with a "Automation error: Error receiving data from socket (possible reboot)" and then a "devicemanager.DMError: unable to connect to 10.250.51.189 after 5 attempts" in the reboot step that triggers the RETRY are not encouraging.
Reporter

Comment 5

7 years ago
Clinging above 50% success, barely, at 54:46. By way of comparison, an unbroken tegra will be between 80 and 100% success these days.
Reporter

Comment 6

7 years ago
No surprise that it lost its grip, down to 44% success over the last 100 jobs, 36% over the last 50.
Reporter

Comment 7

7 years ago
40% since I reopened the bug, 20% for the last 25. I can't imagine why it might be that whenever anything goes wrong with an Android test run, I blame the slave first.
Reporter

Comment 8

7 years ago
Should I just keep a mental list of the slaves which are so completely horribly broken that every result from them should be completely ignored?

And if so, how should I apply that list to people's try runs, which I only very rarely look at?
Severity: normal → blocker
Priority: -- → P1
Reporter

Updated

7 years ago
Severity: blocker → normal
Priority: P1 → --

Updated

7 years ago
Summary: [disable me] tegra-349 problem tracking → tegra-349 problem tracking
Back to "normal" since the tegra is disabled.
Severity: blocker → normal
Reporter

Comment 14

7 years ago
Including the run it started one second before your comment, it's done five more runs, which doesn't seem very disabled.
ran stop_cp
Back in production.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Reporter

Comment 20

7 years ago
Looks like it took itself out after 5 runs, though personally I would have called it clearly still broken, defective, non-merchantable and utterly lacking in fitness after the first 3 of those.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Ran stop_cp. Dunno what to do with it. Callek?
Assignee: nobody → bugspam.Callek
Reporter

Updated

7 years ago
Severity: blocker → normal
Reporter

Updated

7 years ago
Depends on: 808437
Ran ./stop_cp.sh
Blocks: 808468
Blocks: 813012
No longer blocks: 813012
No connection on SUT port after multiple powercyles.
Depends on: 822038
Assignee: bugspam.Callek → nobody

Comment 24

7 years ago
Tegra has been reimaged.  No SD card swap.
pdu reboot
start_cp
Back in production.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Still faulty, please can we stop this tegra again :-)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Ran stop_cp, back to recovery for ya!
Depends on: 825335
(mass change: filter on tegraCallek02reboot2013)

I just rebooted this device, hoping that many of the ones I'm doing tonight come back automatically. I'll check back in tomorrow to see if it did, if it does not I'll triage next step manually on a per-device basis.

---
Command I used (with a manual patch to the fabric script to allow this command)

(fabric)[jwood@dev-master01 fabric]$  python manage_foopies.py -j15 -f devices.json `for i in 021 032 036 039 046  048 061 064 066 067 071 074 079 081 082 083 084 088 093 104 106 108 115 116 118 129 152 154 164 168 169 174 179 182 184 187 189 200 207 217 223 228 234 248 255 264 270 277 285 290 294 295 297 298 300 302 304 305 306 307 308 309 310 311 312 314 315 316 319 320 321 322 323 324 325 326 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 345 346 347 348 349 350 354 355 356 358 359 360 361 362 363 364 365 367 368 369; do echo '-D' tegra-$i; done` reboot_tegra

The command does the reboot, one-at-a-time from the foopy the device is connected from. with one ssh connection per foopy
Depends on: 838687
now taking jobs
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
No longer blocks: 808468
Depends on: 808468
Reporter

Comment 31

6 years ago
Can we please decommission this tegra? Whatever the street value of a used tegra which hasn't worked for 8 months is, I'll just send you that much money to throw it in a dumpster.

A non-defective tegra will run between 80 and 100% (yes, not kidding, 100%) green. If 349 really strained and applied itself, it could make it up to 30% green. Except that nagios says it's down, the only acceptable state for it other than in-dumpster.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter

Comment 32

6 years ago
Since the 7th, 34 jobs, 6 green, and since it spends most of the day down but flaps a bit, probably ten times as many nagios alerts as green jobs.
Reporter

Comment 33

6 years ago
Since the 15th, 36 jobs, 4 green. Over the last 4 hours, 4 nagios alerts, zero jobs.
Status: REOPENED → NEW
Depends on: 865749
Per c#31, 32 and 33 I'm convinced this deserves a decomm.

Explicitly disabled this device

[foopy30.build.mozilla.org] run: touch /builds/tegra-349/disabled.flg
[OK]   tegra-349 is disabled
No longer depends on: 865749
Depends on: 877781
Reporter

Comment 35

6 years ago
Despite it being decommed, nagios continues to be upset about it being DOWN.
Alerts disabled in nagios.
Needs to be removed from buildbot-configs and [tools] devices.json
(In reply to John Hopkins (:jhopkins) from comment #37)
> Needs to be removed from buildbot-configs and [tools] devices.json

It wasn't in devices.json. I removed it from buildbot-configs here: https://hg.mozilla.org/build/buildbot-configs/rev/7b9578c9a135

Looks like it's not in nagios either.

DEATH TO THE BAD TEGRAS
Status: NEW → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
I also removed tegra-349 from foopy30
(on foopy30: rm -rf /builds/tegra-349)

Currently we do not have an automatic mechanism to keep the foopy device directories in sync with the devices.json file.

Updated

Last year
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.