Closed
Bug 785655
Opened 13 years ago
Closed 12 years ago
foopy06.build.mtv1 is DOWN
Categories
(Infrastructure & Operations :: DCOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Unassigned)
Details
foopy06 (a staging machine) just started alerting in Nagios.
it is used frequently to stage mobile-automation changes, so bringing it up in a timely manner is very helpful, but its not an urgent "go to DC on a saturday/sunday" issue.
While we do want it up, even more important is trying to figure out WHY it went down, so please look into that and document what happened here.
Thanks
[75] foopy06.build.mtv1 is DOWN: PING CRITICAL - Packet loss = 100%
| Reporter | ||
Updated•13 years ago
|
Severity: normal → major
Comment 1•13 years ago
|
||
Dropping to normal to not page oncall. Since is is a mac mini, it will need someone to be onsite. I'll be sure to poke #dcops to get it attention Monday morning.
Severity: major → normal
Updated•13 years ago
|
colo-trip: --- → mtv1
Comment 2•13 years ago
|
||
There's really no way to get a crash cart or plug a monitor into these foopy machines the way they're set up in 2-idf while they're still racked. I can take a picture if you want but I found the host powered on. I brought it upstairs and plugged it into a keyboard/monitor and it booted up fine. I have reracked the host and it is pingable/sshable.
vle@host-6-245 ~ $ sudo fping foopy06.build.mtv1.mozilla.com
foopy06.build.mtv1.mozilla.com is alive
vle@host-6-245 ~ $ ssh foopy06.build.mtv1.mozilla.com
The authenticity of host 'foopy06.build.mtv1.mozilla.com (10.250.48.200)' can't be established.
RSA key fingerprint is b6:4f:f7:6b:e6:c4:b6:97:9c:ca:89:a7:70:24:3b:34.
Are you sure you want to continue connecting (yes/no)?
Let me know if issues persist.
Thanks,
Van
Assignee: server-ops → vle
Comment 3•13 years ago
|
||
There are a lot of logs about crash reports and low disk space in /var/log/system.log. Not sure if the disk space issues caused the crashes, but cleaning off some stuff might be worthwhile releng folks.
Comment 4•13 years ago
|
||
The disk space usage on the machine seemed way out of whack. df showed all but 2G used, but du didn't agree. bhearsum tried to reboot it and it's pingable but ssh doesn't respond. Can we run some diags on it to see if the hardware is toast?
Comment 5•13 years ago
|
||
Desktop ran hardware diagnostics and everything checked out OK. Do you want to try to reimage the host or should we decommission it?
Thanks,
Van
| Reporter | ||
Comment 6•13 years ago
|
||
(In reply to Van Le [:van] from comment #5)
> Desktop ran hardware diagnostics and everything checked out OK. Do you want
> to try to reimage the host or should we decommission it?
>
If our only choices are reimage or decomm, I vote decom. The work required to set it back up from scratch is not worth it.
After hardware diag, does the info arr said in c#4 still apply?
(In reply to Amy Rich [:arich] [:arr] from comment #4)
> The disk space usage on the machine seemed way out of whack. df showed all
> but 2G used, but du didn't agree. bhearsum tried to reboot it and it's
> pingable but ssh doesn't respond.
Comment 7•13 years ago
|
||
Well, if there are no hardware issues, I can take another crack at it once it's back up again.
Comment 8•13 years ago
|
||
I ran a verify on the disk and it found corruption. Could someone please boot up off of a recovery disk and run repair on the disk?
Comment 9•13 years ago
|
||
I ran a repair disk on foopy06 and it fixed the hard drive errors. Ran the verify afterwards and no errors were detected. Machine is now plugged in at 2.IDF location. Please verify that you can access it.
-Vinh
| Reporter | ||
Comment 10•13 years ago
|
||
Looks up to me, running ./start_cp.sh on it (after I cleared the stale .pid files)
Will reopen if it acts up in the next few hours, or file a new bug if it acts up in a few weeks down the road.
Thanks
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 11•13 years ago
|
||
With the disk corruption fixed, I was able to see that the host was now showing 100% disk usage and that it was coming from /.Spotlight-V100.
It looks like whoever set up this host did not disable indexing, so it ate 56G of space indexing who knows what. I've turned off indexing (mdutil -a -i off) and we're now back down to 26% used.
As a preventative measure, I also ran a for loop across all other foopies to disable spotlight on them as well.
Assignee: vle → arich
Component: Server Operations: DCOps → Server Operations: RelEng
QA Contact: dmoore → arich
| Reporter | ||
Comment 12•12 years ago
|
||
Since this bug has a lot of context, I'm reopening rather than filing anew [22:27:41] nagios-releng Thu 19:28:16 PST [402] foopy06.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
Can we get insight as to what happened, disk corruption again/etc?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•12 years ago
|
Assignee: arich → server-ops-dcops
Component: Server Operations: RelEng → Server Operations: DCOps
QA Contact: arich → dmoore
Comment 13•12 years ago
|
||
Power light was off.
It's up now:
LovelyMacBookAir:~ dumitrugherman$ ping -q -c 5 foopy06.build.mtv1.mozilla.com
PING foopy06.build.mtv1.mozilla.com (10.250.48.200): 56 data bytes
--- foopy06.build.mtv1.mozilla.com ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.485/2.375/3.399/0.815 ms
LovelyMacBookAir:~ dumitrugherman$ c foopy06.build.mtv1.mozilla.com
The authenticity of host 'foopy06.build.mtv1.mozilla.com (10.250.48.200)' can't be established.
RSA key fingerprint is b6:4f:f7:6b:e6:c4:b6:97:9c:ca:89:a7:70:24:3b:34.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'foopy06.build.mtv1.mozilla.com,10.250.48.200' (RSA) to the list of known hosts.
Password:
Assignee: server-ops-dcops → dgherman
Status: REOPENED → RESOLVED
Closed: 13 years ago → 12 years ago
Resolution: --- → FIXED
Comment 14•12 years ago
|
||
This is outage is unrelated to the reason this bug was initially opened. That issue was fixed when this bug was closed.
I checked the logs and there was nothing obvious.
I ran disk utility, and both need to be repaired, but that's quite possibly a result of disk corruption during an unplanned shutdown, not a symptom of the reason it shut down.
Please boot it from cd and run disk utility to fix them and then run a hardware diagnostic to see if there are any other issues.
Updated•12 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•12 years ago
|
Assignee: dgherman → server-ops-dcops
Comment 15•12 years ago
|
||
Hardware diagnostics did not detect any hardware issues except for some corrupted files. Also ran disk utility from boot up to fix disk issues.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•