Closed Bug 1047936 Opened 10 years ago Closed 9 years ago

Decom node5.testing.stage.metrics.scl3.mozilla.com

Categories

(Infrastructure & Operations :: DCOps, task)

Other
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nagiosapi, Unassigned)

References

()

Details

(Keywords: spring-cleaning)

Automated alert report from nagios1.private.scl3.mozilla.com:

Hostname: node5.testing.stage.metrics.scl3.mozilla.com
Service:  Metrics Disk
State:    CRITICAL
Output:   NRPE: Unable to read output

Runbook:  http://m.allizom.org/Metrics+Disk
I can't ssh to the box. I logged into this supermicro box via the OOB and it looks like we have a disk failure?

Not sure if it can be salvaged?
DCOps, can you check out a disk failure on this host, please?
Assignee: nobody → server-ops-dcops
Component: Server Operations: MOC → Server Operations: DCOps
colo-trip: --- → scl3
cant figure out the root password for this box as it needs a manual fsck. can someone pm me or point  me to the right direction? ive tried the root passwords in the sysadmin gpg file with no luck.
:tmary, is this server still in use? all the root passwords we tried didnt work and the server is no longer booting up properly. per SA, we can try to reimage it or decommission it as it is no longer under warranty.
Flags: needinfo?(tmeyarivan)
(In reply to Van Le [:van] from comment #4)
> :tmary, is this server still in use? all the root passwords we tried didnt
> work and the server is no longer booting up properly. per SA, we can try to
> reimage it or decommission it as it is no longer under warranty.

Host should be reimaged if required

If it needs disk replacements, can existing disks on node6.testing.stage be used here ?
(node6 has been decommissioned)

--
Flags: needinfo?(tmeyarivan)
>Host should be reimaged if required

all 4 drives are detected with no errors reported by the sata controller. over to MOC to kick the host. per :rbryce since none of the passwords worked, this host might have not been configured properly.

please let me know if you need further hands on.
Assignee: server-ops-dcops → server-ops
Component: Server Operations: DCOps → Server Operations
QA Contact: dmoore → shyam
node5.testing.stage.metrics.scl3.mozilla.com has been down for almost 2 weeks (See PING below) and probably not very functional for 3 months...
	
IPMI Log - CRITICAL 	11-03-2014 08:22:33 	92d 14h 43m 55s 	3/3 	CHECK_NRPE: Socket timeout after 30 seconds. 
	
Metrics Disk - CRITICAL 	11-03-2014 08:22:58 	92d 15h 0m 59s 	3/3 	CHECK_NRPE: Socket timeout after 15 seconds. 
	
PING - CRITICAL 	11-03-2014 08:26:52 	13d 9h 7m 2s 	3/3 	PING CRITICAL - Packet loss = 100% 
	
Swap - CRITICAL 	11-03-2014 08:26:37 	92d 15h 1m 33s 	3/3 	CHECK_NRPE: Socket timeout after 15 seconds.
According to tmary in last week's data team meeting, this server can be decom'd. Changing the subject to reflect.
Summary: Metrics Disk on node5.testing.stage.metrics.scl3.mozilla.com is CRITICAL: NRPE: Unable to read output → Decom node5.testing.stage.metrics.scl3.mozilla.com
Keywords: spring-cleaning
Whiteboard: [id=nagios1.private.scl3.mozilla.com:395039]
10.22.31.212 = node5.testing.stage.metrics.scl3
No NFS because historically it hasn't used any.  Assuming no netvault.
Pulled from nagios in change 99007.
Already powered off because of damage.  Waiting a week is quite possibly silly, but this is also an easy Friday decom as opposed to some of the more thought-requiring ones I'm doing, so, going through the motions of waiting by throwing it back on the pile for a while.
Group: mozilla-employee-confidential
colo-trip: scl3 → ---
Group: mozilla-employee-confidential
Component: Server Operations → MOC: Service Requests
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → lypulong
Punting over to DCOPs for physical decom.
Component: MOC: Service Requests → DCOps
QA Contact: lypulong
colo-trip: --- → scl3
Host has been decomm'd, inventory and DNS updated.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.