Closed Bug 746071 (tegra-048) Opened 12 years ago Closed 10 years ago

tegra-048 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

ARM
Android
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

()

Details

(Whiteboard: [buildduty][capacity][buildslaves])

I didn't look further back, but the last 200 jobs it's taken have all wound up in RETRY and tears - it needs to sit in the corner and collect itself.
Ran stop_cp on it.
restarted
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 778812
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Went offline. Trying a PDU reboot.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Back online.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Last Job 20 days, 19:39:58 ago

error.flg [Remote Device Error: Unable to properly remove /mnt/sdcard/tests]

remotely reformatted sdcard
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Green Jobs again
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Please swap the SD card.

error.flg [Remote Device Error: unable to write to sdcard]
Status: RESOLVED → REOPENED
Depends on: 802655
Resolution: FIXED → ---
sd card swapped.
36% green over the last 300 runs.
Blocks: 438871
Whiteboard: [buildduty][capacity][buildslaves] → [buildduty][capacity][buildslaves][orange]
Has been running jobs for awhile now.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Yes, it has been running jobs for a while now. Please, please make it stop running jobs.

Dunno if we have a better report, but according to https://secure.pub.build.mozilla.org/buildapi/reports/slaves (once you "Show all entries" and sort by Slave Name to put the tegras together, because none of the other controls work), a non-broken tegra will run at better than 70% green, up to 100% green. This one has been under 50% for its last 50 jobs, under 50% for its last 100 jobs, under 50% for its last 500 jobs.

Randomly choosing from its failures which are currently under my nose, https://tbpl.mozilla.org/php/getParsedLog.php?id=16611152&tree=Mozilla-Inbound is it barely managing to stay awake long enough to get past verify.py, then falling asleep at the start of the test run. https://tbpl.mozilla.org/php/getParsedLog.php?id=16607708&tree=Mozilla-Inbound was a fairly normal failure, I didn't even notice it was 048, except that it's a "failed to initialize browser" of the sort where... it managed to stay awake through verify.py, and then fell asleep before the test could get started running. That run was a quick turnaround, because on the same push it had already run https://tbpl.mozilla.org/php/getParsedLog.php?id=16606802&tree=Mozilla-Inbound where it made it through verify.py, and then fell asleep before the test could start running.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Seems to be running jobs. There's some red talos jobs with "failed to initialize" browser, but I think those are not indicative of tegra failure?
Right, like most of our android failures, which are a very limited set of second-hand reports of something that someone heard was the thing that someone else could see through someone else's window, they are indicative of "talos tried to start up the browser, but didn't get the first thing back from the browser it expected to get." One thing to keep in mind while seeing that failure reports all look familiar is the fact that when a code push completely breaks the browser, so that it can't even start up, I can (and sometimes do) star every failure of a completely broken browser with an existing bug. We just don't get or report all that much detail, and we have so many intermittent failures that anything which can happen probably does have a bug.

In the case of failed to initialize, bug 686085, of the fairly small percentage of them that Orange Factor has actually noticed, https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=686085&endday=2012-10-31&startday=2012-10-24&tree=all, one tegra hit it twice, one tegra (of whom I'll now be suspicious) hit it five times, and this tegra hit it 14 times, half the instances of it. Assuming the instances Orange Factor bothers to notice are representative, disabling this slave should cut the instances of that hundreds-of-failures bug in half.
Just before it took out your release-mozilla-beta_tegra_android-armv6_test-mochitest-1, it took out a reftest with a bug 689856 "just stop." Of the 18 instances of that in the last week that Orange Factor knows about, 13 were this slave (and 4 were 084, the same one I said in comment 13 I'd now be suspicious of).
Ran stop_cp on it. I have no idea what to do with this one, given that it just came back from recovery. Callek?
Assignee: nobody → bugspam.Callek
(In reply to Ben Hearsum [:bhearsum] from comment #15)
> Ran stop_cp on it. I have no idea what to do with this one, given that it
> just came back from recovery. Callek?

Lets find a nearby garbage shute. Objections?
Depends on: 808437
No longer blocks: 438871
Whiteboard: [buildduty][capacity][buildslaves][orange] → [buildduty][capacity][buildslaves]
Ran ./stop_cp.sh
Blocks: 808468
Blocks: 813012
No longer blocks: 813012
Back in production
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
No jobs taken on this device for >= 7 weeks
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(mass change: filter on tegraCallek02reboot2013)

I just rebooted this device, hoping that many of the ones I'm doing tonight come back automatically. I'll check back in tomorrow to see if it did, if it does not I'll triage next step manually on a per-device basis.

---
Command I used (with a manual patch to the fabric script to allow this command)

(fabric)[jwood@dev-master01 fabric]$  python manage_foopies.py -j15 -f devices.json `for i in 021 032 036 039 046  048 061 064 066 067 071 074 079 081 082 083 084 088 093 104 106 108 115 116 118 129 152 154 164 168 169 174 179 182 184 187 189 200 207 217 223 228 234 248 255 264 270 277 285 290 294 295 297 298 300 302 304 305 306 307 308 309 310 311 312 314 315 316 319 320 321 322 323 324 325 326 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 345 346 347 348 349 350 354 355 356 358 359 360 361 362 363 364 365 367 368 369; do echo '-D' tegra-$i; done` reboot_tegra

The command does the reboot, one-at-a-time from the foopy the device is connected from. with one ssh connection per foopy
Depends on: 838687
had to cycle clientproxy and its back up and happy.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
No longer blocks: 808468
Depends on: 808468
Hasn't run a job in 14 days, 4:37:08
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Back in production
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
ping check failing, pdu reboot didn't help
Status: RESOLVED → REOPENED
Depends on: 944498
Resolution: FIXED → ---
Depends on: 949447
sdcard replaced and reimaged/flashed.
Assignee: bugspam.Callek → nobody
handled in last recovery on Dec 16.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
2014-01-16 10:35:34 tegra-048 p    online   active  OFFLINE :: error.flg [Automation Error: Unable to connect to device after 5 attempts] 

pdu reboot didn't help
Status: RESOLVED → REOPENED
Depends on: 960642
Resolution: FIXED → ---
SD card reformatted and flashed.
Assignee: nobody → pmoore
Investigation has shown that the device is running, and so is the sut agent, but the testroot command returns 'unable to determine testroot' which causes the verify.py script to fail, which causes watch_devices.sh to disable the buildbot slave to prevent it from taking jobs.

[cltbld@foopy112.tegra.releng.scl3.mozilla.com sut_tools]$ telnet tegra-048 20701
Trying 10.26.85.33...
Connected to tegra-048.
Escape character is '^]'.
$>testroot
##AGENT-WARNING##  unable to determine test root
$>quit
quit
$>Connection closed by foreign host.
[cltbld@foopy112.tegra.releng.scl3.mozilla.com sut_tools]$ 

I will now investigate why the test root cannot be determined...
QA Contact: armenzg → bugspam.Callek
Since it wound up reenabled and running jobs, I'll wager any evidence of why it was busted back in February is now long gone.
Assignee: pmoore → nobody
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.