Closed Bug 923388 Opened 11 years ago Closed 11 years ago

db1.iddb PHX1 experiencing repeated reboots

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: gene, Assigned: gene)

References

Details

Gene Wood [:gene]

Assignee

Description

•

11 years ago

Please examine the system and hardware under the db1.iddb server in PHX1 it's experienced 2 reboots in the last week taking down Persona.

Gene Wood [:gene]

Assignee

Updated

•

11 years ago

Blocks: 921395

Gene Wood [:gene]

Assignee

Comment 1

•

11 years ago

Now 3 reboots

Justin Lazaro [:jlaz] (use needinfo)

Comment 2

•

11 years ago

Filed HP ticket: 4646274132

Looks like a faulty memory module, logged in the IML log:

22:50:48 <@nagios-svc-phx1> Fri 22:50:48 PDT [199] db1.iddb.phx1.svc:hplog is CRITICAL: CRITICAL 0001: Uncorrectable Memory Error ((Processor 2, Memory Module 2))

Having an HP tech sent to PHX1 to replace the module, but that may require a small amount of downtime.  Gene, would it be fine to perform what you discussed yesterday and promote a db blade (db2.iddb.phx1) that is healthy?  It is hard to predict when the next crash will happen, but since this is happening much frequently, it would probably safer if we shutdown this blade entirely until the blade is serviced

Gene Wood [:gene]

Assignee

Comment 3

•

11 years ago

I spoke with jlaz as well as callahad and lloyd and we're going to have the tech swap out the bad memory at 01:00 AM PDT.

Here is the change plan that I'd like to have someone look over and r+

https://mana.mozilla.org/wiki/display/SVCOPS/CW-20131006

Gene Wood [:gene]

Assignee

Updated

•

11 years ago

Assignee: nobody → gene

Gene Wood [:gene]

Assignee

Comment 4

•

11 years ago

I've re-established slave replication from master db1.iddb.phx1 to slave db2.iddb.phx1
I've confirmed the log-bin and log-slave-updates settings are correct to enable promotion of db2 to being a master if we need to

Gene Wood [:gene]

Assignee

Comment 5

•

11 years ago

I've spoken on the phone with the HP Tech, David Kimball and given him my phone number. He's going to call me just before the maintenance so I can shut down the server.

Gene Wood [:gene]

Assignee

Comment 6

•

11 years ago

We're delayed a bit. The onsite technician is still locating the server

Gene Wood [:gene]

Assignee

Comment 7

•

11 years ago

Tech found the machine. I shut it down at 01:23 AM PDT. The blade didn't power down on it's own. While waiting for a console interface to come up I finally gave up and had the tech power the blade down at 01:34 AM PDT

The tech had the memory swapped and the blade back in at 01:36AM PDT
At 01:42 AM PDT the blade started responding on network and replication began again.

I've created an account and confirmed that persona is working.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

db1.iddb PHX1 experiencing repeated reboots

Categories

(Cloud Services :: Operations: Miscellaneous, task)

Tracking

(Not tracked)

People

(Reporter: gene, Assigned: gene)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7