Closed
Bug 923388
Opened 11 years ago
Closed 11 years ago
db1.iddb PHX1 experiencing repeated reboots
Categories
(Cloud Services :: Operations: Miscellaneous, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gene, Assigned: gene)
References
Details
Please examine the system and hardware under the db1.iddb server in PHX1 it's experienced 2 reboots in the last week taking down Persona.
Assignee | ||
Comment 1•11 years ago
|
||
Now 3 reboots
Comment 2•11 years ago
|
||
Filed HP ticket: 4646274132 Looks like a faulty memory module, logged in the IML log: 22:50:48 <@nagios-svc-phx1> Fri 22:50:48 PDT [199] db1.iddb.phx1.svc:hplog is CRITICAL: CRITICAL 0001: Uncorrectable Memory Error ((Processor 2, Memory Module 2)) Having an HP tech sent to PHX1 to replace the module, but that may require a small amount of downtime. Gene, would it be fine to perform what you discussed yesterday and promote a db blade (db2.iddb.phx1) that is healthy? It is hard to predict when the next crash will happen, but since this is happening much frequently, it would probably safer if we shutdown this blade entirely until the blade is serviced
Assignee | ||
Comment 3•11 years ago
|
||
I spoke with jlaz as well as callahad and lloyd and we're going to have the tech swap out the bad memory at 01:00 AM PDT. Here is the change plan that I'd like to have someone look over and r+ https://mana.mozilla.org/wiki/display/SVCOPS/CW-20131006
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → gene
Assignee | ||
Comment 4•11 years ago
|
||
I've re-established slave replication from master db1.iddb.phx1 to slave db2.iddb.phx1 I've confirmed the log-bin and log-slave-updates settings are correct to enable promotion of db2 to being a master if we need to
Assignee | ||
Comment 5•11 years ago
|
||
I've spoken on the phone with the HP Tech, David Kimball and given him my phone number. He's going to call me just before the maintenance so I can shut down the server.
Assignee | ||
Comment 6•11 years ago
|
||
We're delayed a bit. The onsite technician is still locating the server
Assignee | ||
Comment 7•11 years ago
|
||
Tech found the machine. I shut it down at 01:23 AM PDT. The blade didn't power down on it's own. While waiting for a console interface to come up I finally gave up and had the tech power the blade down at 01:34 AM PDT The tech had the memory swapped and the blade back in at 01:36AM PDT At 01:42 AM PDT the blade started responding on network and replication began again. I've created an account and confirmed that persona is working.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•