Closed Bug 601123 Opened 15 years ago Closed 15 years ago

please run hardware diagnostics on linux-ix-slave17

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [fixed by IX (drive with bad sectors), will return to SCL][buildduty])

This machine repeatedly got into uninterruptible sleep while doing disk-heavy operations and required rebooting.
Assignee: server-ops → jlazaro
Sent an email to IX support to take a look at this/ask for advice
Assignee: jlazaro → server-ops
Assignee: server-ops → jlazaro
Just handed this machine (Asset tag #4636) to Ramsey of IX Systems, root password reset, so that they can debug this issue further.
Looks like theyve found a bad disk on this machine, will update as soon as i find out more
Got this back yesterday, here's an update from Matt Finney of IX regarding this machine: -- Hello, The system has been repaired. Originally the drive was disconnecting and reconnecting to SATA repeatedly while idle. I ran Seagate diagnostics on the drive, which found 2 bad sectors that I repaired. I replaced the SATA cable as well. I'm not able to reproduce any disk issues after those changes. -- Will bring back on next trip to Internap
Flags: colo-trip+
Whiteboard: [fixed by IX (drive with bad sectors), will return to SCL]
Status: NEW → ASSIGNED
Thanks! I guess the overall conclusion with these machines is that we'll treat issues on a case-by-case basis? Eg, IX Systems has not found anything that leads them to believe there's a larger issue at work?
Assignee: jlazaro → jdow
Brought machine to Internap and it is online now. Note: This machine was originally at Castro, so RelEng will need to update it to work at Internap.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Please put the root password back to the RelEng value.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
password reset to current RelEng value.
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
I'm not sure this is fixed. It was pretty slow at cleaning up files from disk, so I started a fsck on /builds and 20 minutes later it's still on Pass 1. I'll let it run a bit more ...
Took 35 minutes on a fairly empty 168G partition, which seems a bit slow. More info on bug 611128.
Yeah, definitely still some disk error on this, see bug 611128 comment #4 for details on a 12x slowdown. We could try re-imaging the machine if it's a data error, or do more diagnostics to look for a busted drive. Thoughts ?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Is this out of production, such that I can re-image at will?
ping?
I believe it's down. I can't ssh to it in any case, and it hasn't done any work since october.
I re-imaged it. It is currently stuck on trying to start puppet. Not sure how to get around that. Please check it out and see if things are better or not. The machine is reachable over IPMI at 10.12.48.253.
grabbing the bug to look at it's puppet config and see if we can get it back online monday
Assignee: jdow → bear
Whiteboard: [fixed by IX (drive with bad sectors), will return to SCL] → [fixed by IX (drive with bad sectors), will return to SCL][buildduty]
What is the status on this? If the box is back up can we close this bug?
Assignee: bear → server-ops-releng
Component: Server Operations → Server Operations: RelEng
QA Contact: mrz → zandr
Filed bug 636827 for the post-imaging work by releng.
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.