Closed Bug 1094087 Opened 11 years ago Closed 11 years ago

LSI Raid on backup1.private.par1.mozilla.com is WARNING: WARNING: 0:0:RAID-6:16 drives:1.818TB:Optimal Drives:16 online(23 Errors)

Categories

(Infrastructure & Operations :: MOC: Problems, task)

Other
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nagiosapi, Unassigned)

References

()

Details

(Whiteboard: [id=nagios1.private.scl3.mozilla.com:460996])

Automated alert report from nagios1.private.scl3.mozilla.com: Hostname: backup1.private.par1.mozilla.com Service: LSI Raid State: WARNING Output: WARNING: 0:0:RAID-6:16 drives:1.818TB:Optimal Drives:16 online(23 Errors) Runbook: http://m.allizom.org/LSI+Raid
Automated alert acknowledgement: (ashish)rebuilding drive bug 1094087
Status: NEW → ASSIGNED
Took drive offline and initiated a rebuild: > Rebuild Progress on Device at Enclosure 17, Slot 1 Completed 0% in 2 Minutes. Once rebuild completes, run the following command to put the drive online again: > # cd /opt/MegaRAID/MegaCli > # ./MegaCli64 -PDOnline -PhysDrv [17:1] -a0
Asish, the server is beeping all the time. Since there is nothing on it I just took it offline. Please let me know when we can work together on it. Thanks!
The machine has been downtimed until later in the day (when PAR staff are not in the office) the rebuild was interrupted so needs to be done again.
Rick can you take a look at this on your shift, the PAR staff should be out and I've asked Guiom to switch it on.
Flags: needinfo?(rbryce)
Hello Guillaume, Can you turn on the server again? The server will continue to beep until the raid finishes rebuilding, we could perhaps silence the alarm if needed.
The server is back online; $ uptime 17:51:40 up 56 min, 1 user, load average: 0.00, 0.01, 0.00 cd /opt/MegaRAID/MegaCli sudo ./MegaCli64 -AdpSetProp AlarmSilence -a0 # this silences the beeping # Rebuild status: $ sudo ./MegaCli64 -PDRbld -ShowProg -PhysDrv [17:1] -a0 Rebuild Progress on Device at Enclosure 17, Slot 1 Completed 91% in 311 Minutes. Exit Code: 0x00 Seems the rebuilding might be stuck so will check in again (~1-2hrs) for any progress. May need to "remove" the drive and restart rebuild (possible that the turning off of the server interrupted the rebuilding and left it in a irrecoverable state).
Flags: needinfo?(rbryce)
Automated alert acknowledgement: (rbryce)WIP
Automated alert recovery: Hostname: backup1.private.par1.mozilla.com Service: LSI Raid State: OK Output: OK: 0:0:RAID-6:16 drives:1.818TB:Optimal Drives:16 online
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
This recurred and a repeat of rebuilding didn't help clear out the errors. I've filed Bug 1096415 for replacing the drive.
Ashish, I was hoping you could help with the bleeping, the Paris folks are complaining that they cannot use the community space close-by.
Flags: needinfo?(ashish)
Alarm was silenced. [12:22] < guiom> | Aj awesome the bleeping stopped.
Flags: needinfo?(ashish)
The following (which was executed), should permanently disable ALL alarms (beeping): $ sudo ./MegaCli64 -AdpSetProp AlarmDsbl -aALL Adapter 0: Set alarm to Disabled success. Exit Code: 0x00
Component: MOC: Incidents → MOC: Problems
You need to log in before you can comment on or make changes to this bug.