Closed Bug 944814 (talos-linux32-ix-026) Opened 12 years ago Closed 10 years ago

talos-linux32-ix-026 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2167] [buildduty][buildslaves][capacity])

PING DOWN. I cannot reboot it through slaveapi.
And it is back.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Depends on: 950181
Hasn't taken a job in months, needs a re-image.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It has been taking jobs after the re-image.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 992477)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 995771)
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 998359)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1041030)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1045793)
another false positive by slaveapi
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1050955)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1051623)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1053026)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Back in production.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1065317)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1065837)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Disabled in slavealloc - the bug is not that it needs to constantly be manually rebooted, because it does not: after each of those reboot bugs, it just continues merrily on its way. The bug is that slaverebooter can't reboot it, and we're not getting anywhere on that by rebooting a running slave over and over.
Depends on: 1066267
:callek/philor/coop, can you guys reenable this host? iX asked me to reseat the memory, cables, and reset the BIOS. i also reimaged the host. if the issue occurs again, i think we can ask for a system board replacement.
Reenabled a couple of hours ago, it made it through one reboot at least.
Made it through several reboots, actually, but it hasn't managed to connect to a master so it isn't really doing any work.
:philor, does that indicate a hardware (NIC/network) failure? is the first attempt to reboot by SSH then if that fails, it tries to reboot by IPMI?
I can at least answer the latter question: yes, slaverebooter first tries SSH, and if that fails tries IPMI, and as https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-linux32-ix&name=talos-linux32-ix-026 says, it's been failing at SSH and succeeding at IPMI since I reenabled it. As to what it means, that page is the only thing I have to go by, and all it says is that despite successful IPMI reboots (or what slaverebooter thinks are successful reboots, I also don't have any way of knowing whether or not they really are successful reboots), it isn't getting as far as starting up buildbot and connecting to a master and being given jobs to run.
weird. i sent 10k pings to the host and had 0 packets lost. i am able to ssh into the host with no issues. looking at last|less, it looks like the host was rebooted several times yesterday by the script (user cltbld), although it's been up for the past 16 hours so i guess it's failing to reboot now? --- talos-linux32-ix-026.test.releng.scl3.mozilla.com ping statistics --- 10000 packets transmitted, 10000 packets received, 0.0% packet loss IPMI is also reachable. Could there be an issue with the script or permissions that allows the host to reboot? I don't see anything indicative of a hardware (nic/ipmi) failure.
Flags: needinfo?(bugspam.Callek)
well its still not successfully working ala buildbot.... (no jobs despite the most-recent slaverebooter/slaveapi attempts being able to reboot, but only via IPMI it claims)
Flags: needinfo?(bugspam.Callek)
the host is still up and I am able to ssh in as the user cltbld. is there a way to force the script to run to see where it's stalling?
I've kicked off a re-image here and will monitor it.
Back in production.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1084957)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
QA Contact: armenzg → bugspam.Callek
Resolution: --- → FIXED
Disabled, like bug 1052108 comment 5 it looks to be stuck on an old version of talos.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [buildduty][buildslaves][capacity] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2152] [buildduty][buildslaves][capacity]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2152] [buildduty][buildslaves][capacity] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2165] [buildduty][buildslaves][capacity]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2165] [buildduty][buildslaves][capacity] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2167] [buildduty][buildslaves][capacity]
Re-imaged and returned to production.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
(In reply to Chris Cooper [:coop] from comment #29) > Re-imaged and returned to production. ...or not. Attempting to re-image, I was dropped into a GRUB menu with memtest options. Not sure whether diagnostic media is still attached to this machine or what.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1095239
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1095246)
Diagnosticsed and reimaged, rereenabled.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Three jobs so far: a green mozilla-aurora, and two trunk jobs where it crashed on startup. We apparently don't actually care about that sort of thing, just blindly starring it and retriggering it, so I'm leaving it enabled for a while to build up a more obviously broken resume.
Yeah, pretty broken alright. Killed for random talos crashes.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Not at all random, which is what makes them interesting. The same crash on all four trunk jobs it took, no crash on either non-trunk job it took. Do we have two ways of reimaging these slaves, one of which sticks them with one old version of talos which just fails, and another which sticks them with a different old version of talos which crashes?
I'm getting a "media test failure, check cable" error message when trying to PXE boot. Also, getting dropped into the same GRUB menu as before. We may need iX intervention here.
Depends on: 1107785
Back in production.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
One of two slaves hitting bug 977306. Disabled.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1137314
Re-imaged and returned to production.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Still torching talos jobs.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1141416
Reenabled to see if this is fixed by bug 1154434.
Nope, still burning.
No longer depends on: 1141416
Re-imaged, restarted httpd and enabled in slavealloc.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Depends on: 1141416
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Re-imaged and enabled in slavealloc. Already completed several jobs with no issues.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.