Closed Bug 923388 Opened 11 years ago Closed 11 years ago

db1.iddb PHX1 experiencing repeated reboots

Categories

(Cloud Services :: Operations: Miscellaneous, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gene, Assigned: gene)

References

Details

Please examine the system and hardware under the db1.iddb server in PHX1 it's experienced 2 reboots in the last week taking down Persona.
Blocks: 921395
Now 3 reboots
Filed HP ticket: 4646274132

Looks like a faulty memory module, logged in the IML log:

22:50:48 <@nagios-svc-phx1> Fri 22:50:48 PDT [199] db1.iddb.phx1.svc:hplog is CRITICAL: CRITICAL 0001: Uncorrectable Memory Error ((Processor 2, Memory Module 2))

Having an HP tech sent to PHX1 to replace the module, but that may require a small amount of downtime.  Gene, would it be fine to perform what you discussed yesterday and promote a db blade (db2.iddb.phx1) that is healthy?  It is hard to predict when the next crash will happen, but since this is happening much frequently, it would probably safer if we shutdown this blade entirely until the blade is serviced
I spoke with jlaz as well as callahad and lloyd and we're going to have the tech swap out the bad memory at 01:00 AM PDT.

Here is the change plan that I'd like to have someone look over and r+

https://mana.mozilla.org/wiki/display/SVCOPS/CW-20131006
Assignee: nobody → gene
I've re-established slave replication from master db1.iddb.phx1 to slave db2.iddb.phx1
I've confirmed the log-bin and log-slave-updates settings are correct to enable promotion of db2 to being a master if we need to
I've spoken on the phone with the HP Tech, David Kimball and given him my phone number. He's going to call me just before the maintenance so I can shut down the server.
We're delayed a bit. The onsite technician is still locating the server
Tech found the machine. I shut it down at 01:23 AM PDT. The blade didn't power down on it's own. While waiting for a console interface to come up I finally gave up and had the tech power the blade down at 01:34 AM PDT

The tech had the memory swapped and the blade back in at 01:36AM PDT
At 01:42 AM PDT the blade started responding on network and replication began again.

I've created an account and confirmed that persona is working.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.