Closed
Bug 746071
(tegra-048)
Opened 12 years ago
Closed 10 years ago
tegra-048 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
()
Details
(Whiteboard: [buildduty][capacity][buildslaves])
I didn't look further back, but the last 200 jobs it's taken have all wound up in RETRY and tears - it needs to sit in the corner and collect itself.
Comment 1•12 years ago
|
||
Ran stop_cp on it.
Updated•12 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•12 years ago
|
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 3•12 years ago
|
||
Went offline. Trying a PDU reboot.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 4•12 years ago
|
||
Back online.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 5•12 years ago
|
||
Last Job 20 days, 19:39:58 ago error.flg [Remote Device Error: Unable to properly remove /mnt/sdcard/tests] remotely reformatted sdcard
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 6•12 years ago
|
||
Green Jobs again
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 7•12 years ago
|
||
Please swap the SD card. error.flg [Remote Device Error: unable to write to sdcard]
Comment 8•12 years ago
|
||
sd card swapped.
Reporter | ||
Comment 9•12 years ago
|
||
36% green over the last 300 runs.
Blocks: 438871
Whiteboard: [buildduty][capacity][buildslaves] → [buildduty][capacity][buildslaves][orange]
Comment 10•12 years ago
|
||
Has been running jobs for awhile now.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 11•12 years ago
|
||
Yes, it has been running jobs for a while now. Please, please make it stop running jobs. Dunno if we have a better report, but according to https://secure.pub.build.mozilla.org/buildapi/reports/slaves (once you "Show all entries" and sort by Slave Name to put the tegras together, because none of the other controls work), a non-broken tegra will run at better than 70% green, up to 100% green. This one has been under 50% for its last 50 jobs, under 50% for its last 100 jobs, under 50% for its last 500 jobs. Randomly choosing from its failures which are currently under my nose, https://tbpl.mozilla.org/php/getParsedLog.php?id=16611152&tree=Mozilla-Inbound is it barely managing to stay awake long enough to get past verify.py, then falling asleep at the start of the test run. https://tbpl.mozilla.org/php/getParsedLog.php?id=16607708&tree=Mozilla-Inbound was a fairly normal failure, I didn't even notice it was 048, except that it's a "failed to initialize browser" of the sort where... it managed to stay awake through verify.py, and then fell asleep before the test could get started running. That run was a quick turnaround, because on the same push it had already run https://tbpl.mozilla.org/php/getParsedLog.php?id=16606802&tree=Mozilla-Inbound where it made it through verify.py, and then fell asleep before the test could start running.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 12•12 years ago
|
||
Seems to be running jobs. There's some red talos jobs with "failed to initialize" browser, but I think those are not indicative of tegra failure?
Reporter | ||
Comment 13•12 years ago
|
||
Right, like most of our android failures, which are a very limited set of second-hand reports of something that someone heard was the thing that someone else could see through someone else's window, they are indicative of "talos tried to start up the browser, but didn't get the first thing back from the browser it expected to get." One thing to keep in mind while seeing that failure reports all look familiar is the fact that when a code push completely breaks the browser, so that it can't even start up, I can (and sometimes do) star every failure of a completely broken browser with an existing bug. We just don't get or report all that much detail, and we have so many intermittent failures that anything which can happen probably does have a bug. In the case of failed to initialize, bug 686085, of the fairly small percentage of them that Orange Factor has actually noticed, https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=686085&endday=2012-10-31&startday=2012-10-24&tree=all, one tegra hit it twice, one tegra (of whom I'll now be suspicious) hit it five times, and this tegra hit it 14 times, half the instances of it. Assuming the instances Orange Factor bothers to notice are representative, disabling this slave should cut the instances of that hundreds-of-failures bug in half.
Reporter | ||
Comment 14•12 years ago
|
||
Just before it took out your release-mozilla-beta_tegra_android-armv6_test-mochitest-1, it took out a reftest with a bug 689856 "just stop." Of the 18 instances of that in the last week that Orange Factor knows about, 13 were this slave (and 4 were 084, the same one I said in comment 13 I'd now be suspicious of).
Comment 15•12 years ago
|
||
Ran stop_cp on it. I have no idea what to do with this one, given that it just came back from recovery. Callek?
Assignee: nobody → bugspam.Callek
Comment 16•12 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #15) > Ran stop_cp on it. I have no idea what to do with this one, given that it > just came back from recovery. Callek? Lets find a nearby garbage shute. Objections?
Reporter | ||
Updated•12 years ago
|
No longer blocks: 438871
Whiteboard: [buildduty][capacity][buildslaves][orange] → [buildduty][capacity][buildslaves]
Updated•12 years ago
|
Comment 18•12 years ago
|
||
Back in production
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 19•11 years ago
|
||
No jobs taken on this device for >= 7 weeks
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 20•11 years ago
|
||
(mass change: filter on tegraCallek02reboot2013) I just rebooted this device, hoping that many of the ones I'm doing tonight come back automatically. I'll check back in tomorrow to see if it did, if it does not I'll triage next step manually on a per-device basis. --- Command I used (with a manual patch to the fabric script to allow this command) (fabric)[jwood@dev-master01 fabric]$ python manage_foopies.py -j15 -f devices.json `for i in 021 032 036 039 046 048 061 064 066 067 071 074 079 081 082 083 084 088 093 104 106 108 115 116 118 129 152 154 164 168 169 174 179 182 184 187 189 200 207 217 223 228 234 248 255 264 270 277 285 290 294 295 297 298 300 302 304 305 306 307 308 309 310 311 312 314 315 316 319 320 321 322 323 324 325 326 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 345 346 347 348 349 350 354 355 356 358 359 360 361 362 363 364 365 367 368 369; do echo '-D' tegra-$i; done` reboot_tegra The command does the reboot, one-at-a-time from the foopy the device is connected from. with one ssh connection per foopy
Comment 21•11 years ago
|
||
had to cycle clientproxy and its back up and happy.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Comment 22•11 years ago
|
||
Hasn't run a job in 14 days, 4:37:08
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 23•11 years ago
|
||
Back in production
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 24•11 years ago
|
||
ping check failing, pdu reboot didn't help
Comment 25•11 years ago
|
||
sdcard replaced and reimaged/flashed.
Updated•10 years ago
|
Assignee: bugspam.Callek → nobody
Comment 26•10 years ago
|
||
handled in last recovery on Dec 16.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 10 years ago
Resolution: --- → FIXED
Comment 27•10 years ago
|
||
2014-01-16 10:35:34 tegra-048 p online active OFFLINE :: error.flg [Automation Error: Unable to connect to device after 5 attempts] pdu reboot didn't help
Comment 28•10 years ago
|
||
SD card reformatted and flashed.
Updated•10 years ago
|
Assignee: nobody → pmoore
Comment 29•10 years ago
|
||
Investigation has shown that the device is running, and so is the sut agent, but the testroot command returns 'unable to determine testroot' which causes the verify.py script to fail, which causes watch_devices.sh to disable the buildbot slave to prevent it from taking jobs. [cltbld@foopy112.tegra.releng.scl3.mozilla.com sut_tools]$ telnet tegra-048 20701 Trying 10.26.85.33... Connected to tegra-048. Escape character is '^]'. $>testroot ##AGENT-WARNING## unable to determine test root $>quit quit $>Connection closed by foreign host. [cltbld@foopy112.tegra.releng.scl3.mozilla.com sut_tools]$ I will now investigate why the test root cannot be determined...
Reporter | ||
Updated•10 years ago
|
QA Contact: armenzg → bugspam.Callek
Reporter | ||
Comment 30•10 years ago
|
||
Since it wound up reenabled and running jobs, I'll wager any evidence of why it was busted back in February is now long gone.
Assignee: pmoore → nobody
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•