Automated alert report from nagios1.private.scl3.mozilla.com: Hostname: node5.testing.stage.metrics.scl3.mozilla.com Service: Metrics Disk State: CRITICAL Output: NRPE: Unable to read output Runbook: http://m.allizom.org/Metrics+Disk
I can't ssh to the box. I logged into this supermicro box via the OOB and it looks like we have a disk failure? Not sure if it can be salvaged?
DCOps, can you check out a disk failure on this host, please?
Assignee: nobody → server-ops-dcops
Component: Server Operations: MOC → Server Operations: DCOps
cant figure out the root password for this box as it needs a manual fsck. can someone pm me or point me to the right direction? ive tried the root passwords in the sysadmin gpg file with no luck.
:tmary, is this server still in use? all the root passwords we tried didnt work and the server is no longer booting up properly. per SA, we can try to reimage it or decommission it as it is no longer under warranty.
(In reply to Van Le [:van] from comment #4) > :tmary, is this server still in use? all the root passwords we tried didnt > work and the server is no longer booting up properly. per SA, we can try to > reimage it or decommission it as it is no longer under warranty. Host should be reimaged if required If it needs disk replacements, can existing disks on node6.testing.stage be used here ? (node6 has been decommissioned) --
>Host should be reimaged if required all 4 drives are detected with no errors reported by the sata controller. over to MOC to kick the host. per :rbryce since none of the passwords worked, this host might have not been configured properly. please let me know if you need further hands on.
Assignee: server-ops-dcops → server-ops
Component: Server Operations: DCOps → Server Operations
QA Contact: dmoore → shyam
node5.testing.stage.metrics.scl3.mozilla.com has been down for almost 2 weeks (See PING below) and probably not very functional for 3 months... IPMI Log - CRITICAL 11-03-2014 08:22:33 92d 14h 43m 55s 3/3 CHECK_NRPE: Socket timeout after 30 seconds. Metrics Disk - CRITICAL 11-03-2014 08:22:58 92d 15h 0m 59s 3/3 CHECK_NRPE: Socket timeout after 15 seconds. PING - CRITICAL 11-03-2014 08:26:52 13d 9h 7m 2s 3/3 PING CRITICAL - Packet loss = 100% Swap - CRITICAL 11-03-2014 08:26:37 92d 15h 1m 33s 3/3 CHECK_NRPE: Socket timeout after 15 seconds.
According to tmary in last week's data team meeting, this server can be decom'd. Changing the subject to reflect.
Summary: Metrics Disk on node5.testing.stage.metrics.scl3.mozilla.com is CRITICAL: NRPE: Unable to read output → Decom node5.testing.stage.metrics.scl3.mozilla.com
10.22.31.212 = node5.testing.stage.metrics.scl3 No NFS because historically it hasn't used any. Assuming no netvault. Pulled from nagios in change 99007. Already powered off because of damage. Waiting a week is quite possibly silly, but this is also an easy Friday decom as opposed to some of the more thought-requiring ones I'm doing, so, going through the motions of waiting by throwing it back on the pile for a while.
Component: Server Operations → MOC: Service Requests
Product: mozilla.org → Infrastructure & Operations
Punting over to DCOPs for physical decom.
Component: MOC: Service Requests → DCOps
QA Contact: lypulong
Host has been decomm'd, inventory and DNS updated.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.