Closed Bug 753257 (tegra-049) Opened 13 years ago Closed 13 years ago

tegra-049 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Unassigned)

References

()

Details

(Whiteboard: [buildduty][buildslave][capacity])

So tegra-049 needs some TLC. I was going through last-job-per-slave and saw this was over 2 weeks old, pingable but not telnetable (over 20701 like normal), so I killed cp, powercycled it then brought it up. After it was up, tailing the run of cp itself I saw another obvious problem, that is likely a code problem. 2012-05-09 00:18:41,660 INFO MainProcess: Running verify code 2012-05-09 00:18:41,660 DEBUG MainProcess: calling [python /builds/sut_tools/verify.py tegra-049] 2012-05-09 00:24:13,128 DEBUG dialback: error during dialback loop 2012-05-09 00:24:13,130 DEBUG dialback: Traceback (most recent call last): 2012-05-09 00:24:13,131 DEBUG dialback: 2012-05-09 00:24:13,131 DEBUG dialback: File "clientproxy.py", line 219, in handleDialback 2012-05-09 00:24:13,131 DEBUG dialback: asyncore.loop(timeout=1, count=1) 2012-05-09 00:24:13,131 DEBUG dialback: 2012-05-09 00:24:13,131 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ver sions/2.6/lib/python2.6/asyncore.py", line 214, in loop 2012-05-09 00:24:13,131 DEBUG dialback: poll_fun(timeout, map) 2012-05-09 00:24:13,132 DEBUG dialback: 2012-05-09 00:24:13,132 DEBUG dialback: File "/opt/local/Library/Frameworks/Python.framework/Ver sions/2.6/lib/python2.6/asyncore.py", line 140, in poll 2012-05-09 00:24:13,132 DEBUG dialback: r, w, e = select.select(r, w, e, timeout) 2012-05-09 00:24:13,132 DEBUG dialback: 2012-05-09 00:24:13,132 DEBUG dialback: File "clientproxy.py", line 456, in handleSigTERM 2012-05-09 00:24:13,132 DEBUG dialback: db.close() 2012-05-09 00:24:13,133 DEBUG dialback: 2012-05-09 00:24:13,133 DEBUG dialback: AttributeError: 'Process' object has no attribute 'close' 2012-05-09 00:24:13,133 DEBUG dialback: 2012-05-09 00:24:13,133 DEBUG dialback: Traceback End 2012-05-09 00:24:13,133 INFO dialback: process shutting down 2012-05-09 00:24:13,133 DEBUG dialback: running all "atexit" finalizers with priority >= 0 2012-05-09 00:24:13,134 DEBUG dialback: running the remaining "atexit" finalizers 2012-05-09 00:24:13,134 INFO dialback: process exiting with exitcode 0 ^C bash-3.2$ So it lost the connection to tegra-049 shortly after it was brought up, from a powercycle, then hurt itself with a code-error where it can't recover itself. Ran stop_cp.sh and moved to offline
No longer blocks: 752954
Depends on: 752954
Depends on: 754406
So its back up, pingable. BUT we still cannot telnet in to it. Was reimaged in Bug 752954 but did not help, filed Bug 754406 to have it pulled for ateam to investigate.
Whiteboard: [buildslave][capacity]
I ran mochitests 5-6 times, I rebooted it in succession 4 times. It came back up every time. It does take a little while - 20-30s to come back from a reboot, so perhaps this is a race condition in the clientproxy code where clientproxy thinks it rebooted but it is still working on starting up. At any rate, I can't find anything to fix on this one. Going to go put it on Jake's desk with the others.
Whiteboard: [buildslave][capacity] → [buildduty][buildslave][capacity]
Jake - what's up with this tegra?
Assignee: nobody → jwatkins
i've flash/formatted/reinitialized it and put it back into its rackspace
What do we do with this Tegra? It seems to be back, but it's not listed in the dashboard, so I don't know how to bring it up.
Assignee: jwatkins → bear
Added to foopy08, but it's down and not rescuable via pdu reboots.
Assignee: bear → nobody
Depends on: 770519
(In reply to Aki Sasaki [:aki] from comment #6) > Added to foopy08, but it's down and not rescuable via pdu reboots. Ummm, can you take it back off here, unless there is in-fact only 11 tegras on foopy08 including this one? $ python tegras_per_foopy.py PRODUCTION: foopy07 contains 11 tegras foopy08 contains 11 tegras foopy09 contains 11 tegras foopy10 contains 11 tegras foopy11 contains 11 tegras foopy12 contains 13 tegras foopy13 contains 13 tegras foopy14 contains 13 tegras foopy15 contains 13 tegras foopy16 contains 13 tegras foopy17 contains 13 tegras foopy18 contains 13 tegras foopy19 contains 13 tegras foopy20 contains 13 tegras foopy22 contains 13 tegras foopy23 contains 13 tegras foopy24 contains 13 tegras We have 211 tegras in 17 foopies which means a ratio of 12 tegras per foopy Either way, wherever you put it, it needs a tegras.json update from build-tools/buildfarm/mobile Foopy08, from investigation can't reliably handle more than 11 tegras.
Assignee: nobody → aki
Assignee: aki → bugspam.Callek
Removed, still needs rescuing.
We have flashed and reimaged the Tegra board but it is still not coming up. Also tried a different switch port. Appears that the board is bad. Van
Let's decommission it then. I'll file bugs to get monitoring udpated.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
So, DCOps claims this tegra is still happily attached to a PDU, but it is already removed from DNS. We need to finish the decomm story here.
Flags: needinfo?(bugspam.Callek)
Product: mozilla.org → Release Engineering
Flags: needinfo?(bugspam.Callek)
Assignee: bugspam.Callek → nobody
QA Contact: armenzg → bugspam.Callek
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.