Closed Bug 1047936 Opened 10 years ago Closed 9 years ago

Decom node5.testing.stage.metrics.scl3.mozilla.com

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nagiosapi, Unassigned)

References

(
URL
)

Details

(Keywords: spring-cleaning)

MOC Nagios API

Reporter

Description

•

10 years ago

Automated alert report from nagios1.private.scl3.mozilla.com:

Hostname: node5.testing.stage.metrics.scl3.mozilla.com
Service:  Metrics Disk
State:    CRITICAL
Output:   NRPE: Unable to read output

Runbook:  http://m.allizom.org/Metrics+Disk

david garvey:dgarvey

Comment 1

•

10 years ago

I can't ssh to the box. I logged into this supermicro box via the OOB and it looks like we have a disk failure?

Not sure if it can be salvaged?

Peter Radcliffe [:pir]

Comment 2

•

10 years ago

DCOps, can you check out a disk failure on this host, please?

Assignee: nobody → server-ops-dcops

Component: Server Operations: MOC → Server Operations: DCOps

Van Le [:van]

Updated

•

10 years ago

colo-trip: --- → scl3

Van Le [:van]

Comment 3

•

10 years ago

cant figure out the root password for this box as it needs a manual fsck. can someone pm me or point  me to the right direction? ive tried the root passwords in the sysadmin gpg file with no luck.

Van Le [:van]

Comment 4

•

10 years ago

:tmary, is this server still in use? all the root passwords we tried didnt work and the server is no longer booting up properly. per SA, we can try to reimage it or decommission it as it is no longer under warranty.

Flags: needinfo?(tmeyarivan)

T [:tmary] Meyarivan

Comment 5

•

10 years ago

(In reply to Van Le [:van] from comment #4)
> :tmary, is this server still in use? all the root passwords we tried didnt
> work and the server is no longer booting up properly. per SA, we can try to
> reimage it or decommission it as it is no longer under warranty.

Host should be reimaged if required

If it needs disk replacements, can existing disks on node6.testing.stage be used here ?
(node6 has been decommissioned)

--

Flags: needinfo?(tmeyarivan)

Van Le [:van]

Comment 6

•

10 years ago

>Host should be reimaged if required

all 4 drives are detected with no errors reported by the sata controller. over to MOC to kick the host. per :rbryce since none of the passwords worked, this host might have not been configured properly.

please let me know if you need further hands on.

Assignee: server-ops-dcops → server-ops

Component: Server Operations: DCOps → Server Operations

QA Contact: dmoore → shyam

Sheeri Cabral [:sheeri]

Comment 7

•

10 years ago

node5.testing.stage.metrics.scl3.mozilla.com has been down for almost 2 weeks (See PING below) and probably not very functional for 3 months...
	
IPMI Log - CRITICAL 	11-03-2014 08:22:33 	92d 14h 43m 55s 	3/3 	CHECK_NRPE: Socket timeout after 30 seconds. 
	
Metrics Disk - CRITICAL 	11-03-2014 08:22:58 	92d 15h 0m 59s 	3/3 	CHECK_NRPE: Socket timeout after 15 seconds. 
	
PING - CRITICAL 	11-03-2014 08:26:52 	13d 9h 7m 2s 	3/3 	PING CRITICAL - Packet loss = 100% 
	
Swap - CRITICAL 	11-03-2014 08:26:37 	92d 15h 1m 33s 	3/3 	CHECK_NRPE: Socket timeout after 15 seconds.

Sheeri Cabral [:sheeri]

Comment 8

•

10 years ago

According to tmary in last week's data team meeting, this server can be decom'd. Changing the subject to reflect.

Summary: Metrics Disk on node5.testing.stage.metrics.scl3.mozilla.com is CRITICAL: NRPE: Unable to read output → Decom node5.testing.stage.metrics.scl3.mozilla.com

Ludovic Hirlimann [:Usul]

Updated

•

9 years ago

Keywords: spring-cleaning

Whiteboard: [id=nagios1.private.scl3.mozilla.com:395039]

Greg Cox [:gcox]

Comment 9

•

9 years ago

10.22.31.212 = node5.testing.stage.metrics.scl3
No NFS because historically it hasn't used any.  Assuming no netvault.
Pulled from nagios in change 99007.
Already powered off because of damage.  Waiting a week is quite possibly silly, but this is also an easy Friday decom as opposed to some of the more thought-requiring ones I'm doing, so, going through the motions of waiting by throwing it back on the pile for a while.

Group: mozilla-employee-confidential

Greg Cox [:gcox]

Updated

•

9 years ago

colo-trip: scl3 → ---

Peter Radcliffe [:pir]

Updated

•

9 years ago

Group: mozilla-employee-confidential

Component: Server Operations → MOC: Service Requests

Product: mozilla.org → Infrastructure & Operations

Ashish Vijayaram [:ashish]

Updated

•

9 years ago

QA Contact: shyam → lypulong

Vinh Hua [:vinh]

Comment 10

•

9 years ago

Punting over to DCOPs for physical decom.

Component: MOC: Service Requests → DCOps

QA Contact: lypulong

Vinh Hua [:vinh]

Updated

•

9 years ago

colo-trip: --- → scl3

Vinh Hua [:vinh]

Comment 11

•

9 years ago

Host has been decomm'd, inventory and DNS updated.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Decom node5.testing.stage.metrics.scl3.mozilla.com

Categories

(Infrastructure & Operations :: DCOps, task)

Tracking

(Not tracked)

People

(Reporter: nagiosapi, Unassigned)

References

(
URL
)

Details

(Keywords: spring-cleaning)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Updated

Updated

Updated

Comment 10

Updated

Comment 11