Closed Bug 778922 (tegra-182) Opened 12 years ago Closed 11 years ago

tegra-182 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

ARM
Android
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Unassigned)

References

()

Details

(Whiteboard: [buildduty][buildslaves][capacity][badslave])

Last Job 4 days, 19:59:36 ago
tried to PDU reboot, and never lost ping connectivity.

Can't connect remotely.

bash-3.2$ telnet tegra-182 20701
Trying 10.250.50.92...
telnet: connect to address 10.250.50.92: Connection refused
telnet: Unable to connect to remote host
bash-3.2$

Please ensure this (a) is the correct PDU info:

"pdu": "pdu5.df202-1.build.mtv1.mozilla.com", 
"pduid": ".AA10"

And that this tegra has no battery attached, and that its switches are properly set.

Lastly please reimage once more.
Depends on: 780798
Hi Justin,

The pdu info was wrong. It should have been AA2. Question though, should it be listed in inventory as AA2 or A2? It was previously listed as just A10 in inventory so I changed it to A2.

Thanks,
Van
This tegra I think is also having issues ala: Bug 782495 -- I'm running ./stop_cp.sh on its foopy atm, pending ATeam direction on that bug.
Blocks: 782495
This machine is in the production pool, though I'm not sure it's supposed to be.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Well, it's done 101 jobs since it snuck back in, of which only 26 were failures, but 12 of the last 25 makes me think it's having issues a la "this tegra is busted crap that makes Android tests look like something that should be ignored."
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Make that 15 of the last 25, and where I'm seeing it frequently is in bug 689856, which I think probably is just the stealthy way of doing bug 782495 without leaving evidence.

Rather than just taking it out for a few days by running stop_cp and then letting it come back, can we actually disable it? Do we even have a way of disabling broken tegras? We seem to be doing this in bug after bug after bug, "this tegra is broken" "ran stop_cp" "this tegra is back in production" "this tegra is broken and failing all the time" "ran stop_cp" "this tegra is back in production" "this tegra is broken and failing all the time" "ran stop_cp"
Summary: tegra-182 problem tracking → [disable me] tegra-182 problem tracking
Comment #6 clearly needs addressing, but in order to staunch the flakiness stop_cp has been run.
Summary: [disable me] tegra-182 problem tracking → tegra-182 problem tracking
ATeam thinks this is a device issue, lets recover it.
Depends on: 792316
IT handled this, start_cp run now
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Whatever its current state is, it isn't FIXED.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
had to PDU reboot, but PDU settings were wrong (c#2), just fixed them, and the device is up, and looks to be functioning properly.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
SUTAgent not present;

time for recovery
Status: RESOLVED → REOPENED
Depends on: 807965
Resolution: FIXED → ---
Do we have two images that we use for tegra recovery, "the good image" and "the borken image"? This slave appears to have been in pretty good shape, and continuing to do jobs both before and after comment 17, but since the probable time of that recovery (guessing based on when there was a long enough gap in between runs) it has turned simply awful, 18% green.
Depends on: 808437
Ran ./stop_cp.sh
Blocks: 808468
Blocks: 813012
No longer blocks: 813012
Should be back up.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Those duplicates would be the four times that someone wasn't as lucky as I was in comment 7, so they went to all the trouble of filing a bug when the summary could have been "tegra-182 is broken" and the description could have been "tegra-182 is broken."
Severity: normal → major
Status: RESOLVED → REOPENED
Priority: P3 → --
Resolution: FIXED → ---
Whiteboard: [buildduty][buildslaves][capacity] → [buildduty][buildslaves][capacity][badslave]
running stop_cp based on c#25
(mass change: filter on tegraCallek02reboot2013)

I just rebooted this device, hoping that many of the ones I'm doing tonight come back automatically. I'll check back in tomorrow to see if it did, if it does not I'll triage next step manually on a per-device basis.

---
Command I used (with a manual patch to the fabric script to allow this command)

(fabric)[jwood@dev-master01 fabric]$  python manage_foopies.py -j15 -f devices.json `for i in 021 032 036 039 046  048 061 064 066 067 071 074 079 081 082 083 084 088 093 104 106 108 115 116 118 129 152 154 164 168 169 174 179 182 184 187 189 200 207 217 223 228 234 248 255 264 270 277 285 290 294 295 297 298 300 302 304 305 306 307 308 309 310 311 312 314 315 316 319 320 321 322 323 324 325 326 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 345 346 347 348 349 350 354 355 356 358 359 360 361 362 363 364 365 367 368 369; do echo '-D' tegra-$i; done` reboot_tegra

The command does the reboot, one-at-a-time from the foopy the device is connected from. with one ssh connection per foopy
Depends on: 838687
No longer blocks: 808468
Depends on: 808468
Back from recovery
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.