foopy06.build.mtv1 is DOWN

RESOLVED FIXED

Status

RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: Callek, Unassigned)

Tracking

Details

(Reporter)

Description

6 years ago
foopy06 (a staging machine) just started alerting in Nagios.

it is used frequently to stage mobile-automation changes, so bringing it up in a timely manner is very helpful, but its not an urgent "go to DC on a saturday/sunday" issue.

While we do want it up, even more important is trying to figure out WHY it went down, so please look into that and document what happened here.

Thanks

	[75] foopy06.build.mtv1 is DOWN: PING CRITICAL - Packet loss = 100%
(Reporter)

Updated

6 years ago
Severity: normal → major
Dropping to normal to not page oncall. Since is is a mac mini, it will need someone to be onsite. I'll be sure to poke #dcops to get it attention Monday morning.
Severity: major → normal

Updated

6 years ago
colo-trip: --- → mtv1

Comment 2

6 years ago
There's really no way to get a crash cart or plug a monitor into these foopy machines the way they're set up in 2-idf while they're still racked. I can take a picture if you want but I found the host powered on. I brought it upstairs and plugged it into a keyboard/monitor and it booted up fine. I have reracked the host and it is pingable/sshable.

vle@host-6-245 ~ $ sudo fping foopy06.build.mtv1.mozilla.com
foopy06.build.mtv1.mozilla.com is alive
vle@host-6-245 ~ $ ssh foopy06.build.mtv1.mozilla.com
The authenticity of host 'foopy06.build.mtv1.mozilla.com (10.250.48.200)' can't be established.
RSA key fingerprint is b6:4f:f7:6b:e6:c4:b6:97:9c:ca:89:a7:70:24:3b:34.
Are you sure you want to continue connecting (yes/no)? 

Let me know if issues persist.

Thanks,
Van
Assignee: server-ops → vle
There are a lot of logs about crash reports and low disk space in /var/log/system.log.  Not sure if the disk space issues caused the crashes, but cleaning off some stuff might be worthwhile releng folks.
The disk space usage on the machine seemed way out of whack.  df showed all but 2G used, but du didn't agree.  bhearsum tried to reboot it and it's pingable but ssh doesn't respond.  Can we run some diags on it to see if the hardware is toast?

Comment 5

6 years ago
Desktop ran hardware diagnostics and everything checked out OK. Do you want to try to reimage the host or should we decommission it?

Thanks,
Van
(Reporter)

Comment 6

6 years ago
(In reply to Van Le [:van] from comment #5)
> Desktop ran hardware diagnostics and everything checked out OK. Do you want
> to try to reimage the host or should we decommission it?
> 

If our only choices are reimage or decomm, I vote decom. The work required to set it back up from scratch is not worth it. 

After hardware diag, does the info arr said in c#4 still apply?

(In reply to Amy Rich [:arich] [:arr] from comment #4)
> The disk space usage on the machine seemed way out of whack.  df showed all
> but 2G used, but du didn't agree.  bhearsum tried to reboot it and it's
> pingable but ssh doesn't respond.
Well, if there are no hardware issues, I can take another crack at it once it's back up again.
I ran a verify on the disk and it found corruption.  Could someone please boot up off of a recovery disk and run repair on the disk?

Comment 9

6 years ago
I ran a repair disk on foopy06 and it fixed the hard drive errors. Ran the verify afterwards and no errors were detected. Machine is now plugged in at 2.IDF location. Please verify that you can access it.

-Vinh
(Reporter)

Comment 10

6 years ago
Looks up to me, running ./start_cp.sh on it (after I cleared the stale .pid files)

Will reopen if it acts up in the next few hours, or file a new bug if it acts up in a few weeks down the road.

Thanks
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
With the disk corruption fixed, I was able to see that the host was now showing 100% disk usage and that it was coming from /.Spotlight-V100.

It looks like whoever set up this host did not disable indexing, so it ate 56G of space indexing who knows what.  I've turned off indexing (mdutil -a -i off) and we're now back down to 26% used.

As a preventative measure, I also ran a for loop across all other foopies to disable spotlight on them as well.
Assignee: vle → arich
Component: Server Operations: DCOps → Server Operations: RelEng
QA Contact: dmoore → arich
(Reporter)

Comment 12

6 years ago
Since this bug has a lot of context, I'm reopening rather than filing anew [22:27:41]	nagios-releng	Thu 19:28:16 PST [402] foopy06.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

Can we get insight as to what happened, disk corruption again/etc?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: arich → server-ops-dcops
Component: Server Operations: RelEng → Server Operations: DCOps
QA Contact: arich → dmoore
Power light was off.
It's up now:

LovelyMacBookAir:~ dumitrugherman$ ping -q -c 5 foopy06.build.mtv1.mozilla.com 
PING foopy06.build.mtv1.mozilla.com (10.250.48.200): 56 data bytes

--- foopy06.build.mtv1.mozilla.com ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.485/2.375/3.399/0.815 ms

LovelyMacBookAir:~ dumitrugherman$ c foopy06.build.mtv1.mozilla.com
The authenticity of host 'foopy06.build.mtv1.mozilla.com (10.250.48.200)' can't be established.
RSA key fingerprint is b6:4f:f7:6b:e6:c4:b6:97:9c:ca:89:a7:70:24:3b:34.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'foopy06.build.mtv1.mozilla.com,10.250.48.200' (RSA) to the list of known hosts.
Password:
Assignee: server-ops-dcops → dgherman
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED
This is outage is unrelated to the reason this bug was initially opened.  That issue was fixed when this bug was closed.

I checked the logs and there was nothing obvious.  

I ran disk utility, and both need to be repaired, but that's quite possibly a result of disk corruption during an unplanned shutdown, not a symptom of the reason it shut down.

Please boot it from cd and run disk utility to fix them and then run a hardware diagnostic to see if there are any other issues.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: dgherman → server-ops-dcops

Comment 15

6 years ago
Hardware diagnostics did not detect any hardware issues except for some corrupted files.  Also ran disk utility from boot up to fix disk issues.
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.