Closed
Bug 944814
(talos-linux32-ix-026)
Opened 12 years ago
Closed 10 years ago
talos-linux32-ix-026 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Unassigned)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2167] [buildduty][buildslaves][capacity])
PING DOWN. I cannot reboot it through slaveapi.
| Reporter | ||
Comment 1•12 years ago
|
||
And it is back.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 2•12 years ago
|
||
Hasn't taken a job in months, needs a re-image.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Reporter | ||
Comment 3•12 years ago
|
||
It has been taking jobs after the re-image.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 4•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 992477)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 5•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 995771)
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 12 years ago → 11 years ago
Resolution: --- → FIXED
Comment 6•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 998359)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 7•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1041030)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 8•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1045793)
Comment 9•11 years ago
|
||
another false positive by slaveapi
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 10•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1050955)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 11•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1051623)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 12•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1053026)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 13•11 years ago
|
||
Back in production.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 14•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1065317)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 15•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1065837)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 16•11 years ago
|
||
Disabled in slavealloc - the bug is not that it needs to constantly be manually rebooted, because it does not: after each of those reboot bugs, it just continues merrily on its way. The bug is that slaverebooter can't reboot it, and we're not getting anywhere on that by rebooting a running slave over and over.
Comment 17•11 years ago
|
||
:callek/philor/coop, can you guys reenable this host? iX asked me to reseat the memory, cables, and reset the BIOS. i also reimaged the host. if the issue occurs again, i think we can ask for a system board replacement.
Comment 18•11 years ago
|
||
Reenabled a couple of hours ago, it made it through one reboot at least.
Comment 19•11 years ago
|
||
Made it through several reboots, actually, but it hasn't managed to connect to a master so it isn't really doing any work.
Comment 20•11 years ago
|
||
:philor, does that indicate a hardware (NIC/network) failure? is the first attempt to reboot by SSH then if that fails, it tries to reboot by IPMI?
Comment 21•11 years ago
|
||
I can at least answer the latter question: yes, slaverebooter first tries SSH, and if that fails tries IPMI, and as https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-linux32-ix&name=talos-linux32-ix-026 says, it's been failing at SSH and succeeding at IPMI since I reenabled it.
As to what it means, that page is the only thing I have to go by, and all it says is that despite successful IPMI reboots (or what slaverebooter thinks are successful reboots, I also don't have any way of knowing whether or not they really are successful reboots), it isn't getting as far as starting up buildbot and connecting to a master and being given jobs to run.
Comment 22•11 years ago
|
||
weird. i sent 10k pings to the host and had 0 packets lost. i am able to ssh into the host with no issues. looking at last|less, it looks like the host was rebooted several times yesterday by the script (user cltbld), although it's been up for the past 16 hours so i guess it's failing to reboot now?
--- talos-linux32-ix-026.test.releng.scl3.mozilla.com ping statistics ---
10000 packets transmitted, 10000 packets received, 0.0% packet loss
IPMI is also reachable. Could there be an issue with the script or permissions that allows the host to reboot? I don't see anything indicative of a hardware (nic/ipmi) failure.
Updated•11 years ago
|
Flags: needinfo?(bugspam.Callek)
Comment 23•11 years ago
|
||
well its still not successfully working ala buildbot.... (no jobs despite the most-recent slaverebooter/slaveapi attempts being able to reboot, but only via IPMI it claims)
Flags: needinfo?(bugspam.Callek)
Comment 24•11 years ago
|
||
the host is still up and I am able to ssh in as the user cltbld. is there a way to force the script to run to see where it's stalling?
Comment 25•11 years ago
|
||
I've kicked off a re-image here and will monitor it.
Comment 26•11 years ago
|
||
Back in production.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 27•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1084957)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
QA Contact: armenzg → bugspam.Callek
Resolution: --- → FIXED
Comment 28•11 years ago
|
||
Disabled, like bug 1052108 comment 5 it looks to be stuck on an old version of talos.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•11 years ago
|
Whiteboard: [buildduty][buildslaves][capacity] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2152] [buildduty][buildslaves][capacity]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2152] [buildduty][buildslaves][capacity] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2165] [buildduty][buildslaves][capacity]
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2165] [buildduty][buildslaves][capacity] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2167] [buildduty][buildslaves][capacity]
Comment 29•11 years ago
|
||
Re-imaged and returned to production.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 30•11 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #29)
> Re-imaged and returned to production.
...or not. Attempting to re-image, I was dropped into a GRUB menu with memtest options. Not sure whether diagnostic media is still attached to this machine or what.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 31•11 years ago
|
||
Attempting SSH reboot...Failed.
Attempting IPMI reboot...Failed.
Filed IT bug for reboot (bug 1095246)
Comment 32•11 years ago
|
||
Diagnosticsed and reimaged, rereenabled.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 33•11 years ago
|
||
Three jobs so far: a green mozilla-aurora, and two trunk jobs where it crashed on startup. We apparently don't actually care about that sort of thing, just blindly starring it and retriggering it, so I'm leaving it enabled for a while to build up a more obviously broken resume.
Comment 34•11 years ago
|
||
Yeah, pretty broken alright. Killed for random talos crashes.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 35•11 years ago
|
||
Not at all random, which is what makes them interesting. The same crash on all four trunk jobs it took, no crash on either non-trunk job it took. Do we have two ways of reimaging these slaves, one of which sticks them with one old version of talos which just fails, and another which sticks them with a different old version of talos which crashes?
Comment 36•11 years ago
|
||
I'm getting a "media test failure, check cable" error message when trying to PXE boot. Also, getting dropped into the same GRUB menu as before. We may need iX intervention here.
Comment 37•10 years ago
|
||
Back in production.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 10 years ago
Resolution: --- → FIXED
Comment 38•10 years ago
|
||
One of two slaves hitting bug 977306. Disabled.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 39•10 years ago
|
||
Re-imaged and returned to production.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Comment 40•10 years ago
|
||
Still torching talos jobs.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 41•10 years ago
|
||
Reenabled to see if this is fixed by bug 1154434.
Comment 42•10 years ago
|
||
Nope, still burning.
Comment 43•10 years ago
|
||
Re-imaged, restarted httpd and enabled in slavealloc.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Comment 44•10 years ago
|
||
Burning every talos job it touches.
https://treeherder.mozilla.org/logviewer.html#?job_id=12841825&repo=mozilla-inbound
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 45•10 years ago
|
||
Re-imaged and enabled in slavealloc. Already completed several jobs with no issues.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•