Decom node5.testing.stage.metrics.scl3.mozilla.com

RESOLVED FIXED

Status

RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: nagiosapi, Unassigned)

Tracking

({spring-cleaning})

Details

(URL)

(Reporter)

Description

4 years ago
Automated alert report from nagios1.private.scl3.mozilla.com:

Hostname: node5.testing.stage.metrics.scl3.mozilla.com
Service:  Metrics Disk
State:    CRITICAL
Output:   NRPE: Unable to read output

Runbook:  http://m.allizom.org/Metrics+Disk

Comment 1

4 years ago
I can't ssh to the box. I logged into this supermicro box via the OOB and it looks like we have a disk failure?

Not sure if it can be salvaged?
DCOps, can you check out a disk failure on this host, please?
Assignee: nobody → server-ops-dcops
Component: Server Operations: MOC → Server Operations: DCOps

Updated

4 years ago
colo-trip: --- → scl3

Comment 3

4 years ago
cant figure out the root password for this box as it needs a manual fsck. can someone pm me or point  me to the right direction? ive tried the root passwords in the sysadmin gpg file with no luck.

Comment 4

4 years ago
:tmary, is this server still in use? all the root passwords we tried didnt work and the server is no longer booting up properly. per SA, we can try to reimage it or decommission it as it is no longer under warranty.
Flags: needinfo?(tmeyarivan)

Comment 5

4 years ago
(In reply to Van Le [:van] from comment #4)
> :tmary, is this server still in use? all the root passwords we tried didnt
> work and the server is no longer booting up properly. per SA, we can try to
> reimage it or decommission it as it is no longer under warranty.

Host should be reimaged if required

If it needs disk replacements, can existing disks on node6.testing.stage be used here ?
(node6 has been decommissioned)

--
Flags: needinfo?(tmeyarivan)

Comment 6

4 years ago
>Host should be reimaged if required

all 4 drives are detected with no errors reported by the sata controller. over to MOC to kick the host. per :rbryce since none of the passwords worked, this host might have not been configured properly.

please let me know if you need further hands on.
Assignee: server-ops-dcops → server-ops
Component: Server Operations: DCOps → Server Operations
QA Contact: dmoore → shyam
node5.testing.stage.metrics.scl3.mozilla.com has been down for almost 2 weeks (See PING below) and probably not very functional for 3 months...
	
IPMI Log - CRITICAL 	11-03-2014 08:22:33 	92d 14h 43m 55s 	3/3 	CHECK_NRPE: Socket timeout after 30 seconds. 
	
Metrics Disk - CRITICAL 	11-03-2014 08:22:58 	92d 15h 0m 59s 	3/3 	CHECK_NRPE: Socket timeout after 15 seconds. 
	
PING - CRITICAL 	11-03-2014 08:26:52 	13d 9h 7m 2s 	3/3 	PING CRITICAL - Packet loss = 100% 
	
Swap - CRITICAL 	11-03-2014 08:26:37 	92d 15h 1m 33s 	3/3 	CHECK_NRPE: Socket timeout after 15 seconds.
According to tmary in last week's data team meeting, this server can be decom'd. Changing the subject to reflect.
Summary: Metrics Disk on node5.testing.stage.metrics.scl3.mozilla.com is CRITICAL: NRPE: Unable to read output → Decom node5.testing.stage.metrics.scl3.mozilla.com
Blocks: 1096344
Keywords: spring-cleaning
Whiteboard: [id=nagios1.private.scl3.mozilla.com:395039]

Comment 9

4 years ago
10.22.31.212 = node5.testing.stage.metrics.scl3
No NFS because historically it hasn't used any.  Assuming no netvault.
Pulled from nagios in change 99007.
Already powered off because of damage.  Waiting a week is quite possibly silly, but this is also an easy Friday decom as opposed to some of the more thought-requiring ones I'm doing, so, going through the motions of waiting by throwing it back on the pile for a while.
Group: mozilla-employee-confidential

Updated

4 years ago
colo-trip: scl3 → ---

Updated

4 years ago
Group: mozilla-employee-confidential
Component: Server Operations → MOC: Service Requests
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → lypulong

Comment 10

4 years ago
Punting over to DCOPs for physical decom.
Component: MOC: Service Requests → DCOps
QA Contact: lypulong

Updated

4 years ago
colo-trip: --- → scl3

Comment 11

4 years ago
Host has been decomm'd, inventory and DNS updated.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.