778922 - (tegra-182) tegra-182 problem tracking

Reporter

Description

•

12 years ago

Last Job 4 days, 19:59:36 ago

Justin Wood (:Callek)

Reporter

Comment 1

•

12 years ago

tried to PDU reboot, and never lost ping connectivity.

Can't connect remotely.

bash-3.2$ telnet tegra-182 20701
Trying 10.250.50.92...
telnet: connect to address 10.250.50.92: Connection refused
telnet: Unable to connect to remote host
bash-3.2$

Please ensure this (a) is the correct PDU info:

"pdu": "pdu5.df202-1.build.mtv1.mozilla.com", 
"pduid": ".AA10"

And that this tegra has no battery attached, and that its switches are properly set.

Lastly please reimage once more.

Depends on: 780798

Van Le [:van]

Comment 2

•

12 years ago

Hi Justin,

The pdu info was wrong. It should have been AA2. Question though, should it be listed in inventory as AA2 or A2? It was previously listed as just A10 in inventory so I changed it to A2.

Thanks,
Van

Justin Wood (:Callek)

Reporter

Comment 3

•

12 years ago

This tegra I think is also having issues ala: Bug 782495 -- I'm running ./stop_cp.sh on its foopy atm, pending ATeam direction on that bug.

Blocks: 782495

bhearsum@mozilla.com (:bhearsum)

Comment 4

•

12 years ago

This machine is in the production pool, though I'm not sure it's supposed to be.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Comment 5

•

12 years ago

Well, it's done 101 jobs since it snuck back in, of which only 26 were failures, but 12 of the last 25 makes me think it's having issues a la "this tegra is busted crap that makes Android tests look like something that should be ignored."

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Phil Ringnalda (:philor)

Comment 6

•

12 years ago

Make that 15 of the last 25, and where I'm seeing it frequently is in bug 689856, which I think probably is just the stealthy way of doing bug 782495 without leaving evidence.

Rather than just taking it out for a few days by running stop_cp and then letting it come back, can we actually disable it? Do we even have a way of disabling broken tegras? We seem to be doing this in bug after bug after bug, "this tegra is broken" "ran stop_cp" "this tegra is back in production" "this tegra is broken and failing all the time" "ran stop_cp" "this tegra is back in production" "this tegra is broken and failing all the time" "ran stop_cp"

Phil Ringnalda (:philor)

Comment 7

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15246213&tree=Mozilla-Inbound for which I nearly filed an invalid bug

Phil Ringnalda (:philor)

Comment 8

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15253875&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=15250682&tree=Mozilla-Inbound

Phil Ringnalda (:philor)

Updated

•

12 years ago

Summary: tegra-182 problem tracking → [disable me] tegra-182 problem tracking

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

12 years ago

Comment #6 clearly needs addressing, but in order to staunch the flakiness stop_cp has been run.

Summary: [disable me] tegra-182 problem tracking → tegra-182 problem tracking

Justin Wood (:Callek)

Reporter

Comment 10

•

12 years ago

ATeam thinks this is a device issue, lets recover it.

Depends on: 792316

Justin Wood (:Callek)

Reporter

Comment 11

•

12 years ago

IT handled this, start_cp run now

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Comment 12

•

12 years ago

Whatever its current state is, it isn't FIXED.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Justin Wood (:Callek)

Reporter

Comment 13

•

12 years ago

had to PDU reboot, but PDU settings were wrong (c#2), just fixed them, and the device is up, and looks to be functioning properly.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Comment 14

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15452910&tree=Profiling

Phil Ringnalda (:philor)

Comment 15

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15459229&tree=Profiling

Phil Ringnalda (:philor)

Comment 16

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15459701&tree=Mozilla-Inbound

bhearsum@mozilla.com (:bhearsum)

Comment 17

•

12 years ago

SUTAgent not present;

time for recovery

Status: RESOLVED → REOPENED

Depends on: 807965

Resolution: FIXED → ---

Phil Ringnalda (:philor)

Comment 18

•

12 years ago

Do we have two images that we use for tegra recovery, "the good image" and "the borken image"? This slave appears to have been in pretty good shape, and continuing to do jobs both before and after comment 17, but since the probable time of that recovery (guessing based on when there was a long enough gap in between runs) it has turned simply awful, 18% green.

Phil Ringnalda (:philor)

Updated

•

12 years ago

Depends on: 808437

Justin Wood (:Callek)

Reporter

Comment 19

•

12 years ago

Ran ./stop_cp.sh

Blocks: 808468

Justin Wood (:Callek)

Reporter

Updated

•

12 years ago

Blocks: 813012

Justin Wood (:Callek)

Reporter

Updated

•

12 years ago

No longer blocks: 813012

Ed Morley [:emorley]

Updated

•

12 years ago

URL: https://secure.pub.build.mozilla.org/...

Justin Wood (:Callek)

Reporter

Comment 20

•

12 years ago

Should be back up.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Comment 25

•

11 years ago

Those duplicates would be the four times that someone wasn't as lucky as I was in comment 7, so they went to all the trouble of filing a bug when the summary could have been "tegra-182 is broken" and the description could have been "tegra-182 is broken."

Severity: normal → major

Status: RESOLVED → REOPENED

Priority: P3 → --

Resolution: FIXED → ---

Whiteboard: [buildduty][buildslaves][capacity] → [buildduty][buildslaves][capacity][badslave]

Justin Wood (:Callek)

Reporter

Comment 26

•

11 years ago

running stop_cp based on c#25

Justin Wood (:Callek)

Reporter

Comment 27

•

11 years ago

(mass change: filter on tegraCallek02reboot2013)

I just rebooted this device, hoping that many of the ones I'm doing tonight come back automatically. I'll check back in tomorrow to see if it did, if it does not I'll triage next step manually on a per-device basis.

---
Command I used (with a manual patch to the fabric script to allow this command)

(fabric)[jwood@dev-master01 fabric]$  python manage_foopies.py -j15 -f devices.json `for i in 021 032 036 039 046  048 061 064 066 067 071 074 079 081 082 083 084 088 093 104 106 108 115 116 118 129 152 154 164 168 169 174 179 182 184 187 189 200 207 217 223 228 234 248 255 264 270 277 285 290 294 295 297 298 300 302 304 305 306 307 308 309 310 311 312 314 315 316 319 320 321 322 323 324 325 326 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 345 346 347 348 349 350 354 355 356 358 359 360 361 362 363 364 365 367 368 369; do echo '-D' tegra-$i; done` reboot_tegra

The command does the reboot, one-at-a-time from the foopy the device is connected from. with one ssh connection per foopy

Justin Wood (:Callek)

Reporter

Updated

•

11 years ago

Depends on: 838687

Justin Wood (:Callek)

Reporter

Updated

•

11 years ago

No longer blocks: 808468

Depends on: 808468

Justin Wood (:Callek)

Reporter

Comment 28

•

11 years ago

Back from recovery

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Reporter

Updated

•

11 years ago

URL: https://secure.pub.build.mozilla.org/... → https://secure.pub.build.mozilla.org/...

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard