925033 - failover socorro2

Reporter

Description

•

12 years ago

We need to take socorro2 out of the load balancer for bugs 913488 and 887462. We should probably failover so socorro1 is the master, and socorro3 is the receiver - socorro3 is currently the receiver, so that would have to be configured to receive from socorro1. (I'm open to making socorro3 the master, since it's already the receiver in production, if that makes more sense than putting socorro1 directly in as the master).

Sheeri Cabral [:sheeri]

Reporter

Comment 1

•

12 years ago

We would like to do this Tuesday 10/15 if possible, but of course that's up to you.

Flags: needinfo?(laura)

Sheeri Cabral [:sheeri]

Reporter

Comment 2

•

12 years ago

Let's update https://mana.mozilla.org/wiki/display/websites/crash-stats.mozilla.com+master+database+failover with the change process before the date.

Sheeri Cabral [:sheeri]

Reporter

Comment 3

•

12 years ago

That date's not going to work out; we'll need CAB approval. Laura, please give us a date that works for you (I know there's some PTO coming up) and we'll file a CAB bug.

Sheeri Cabral [:sheeri]

Reporter

Updated

•

12 years ago

Depends on: 926539

Sheeri Cabral [:sheeri]

Reporter

Comment 4

•

12 years ago

Failover is set for Tuesday Oct 22nd after 5 pm Pacific From last time - https://etherpad.mozilla.org/failover-prep

Matt Pressman [:mpressman]

Comment 5

•

12 years ago

Updated etherpad, removed notes from previous failover regarding disk upgrades and changed hostnames to reflect failover from tp-socorro01-master02

Matt Pressman [:mpressman]

Comment 6

•

12 years ago

failover reverted before enabling any services or repointing the load balancer

Matt Pressman [:mpressman]

Comment 7

•

12 years ago

rebuild of socorro-reporting1.db.phx1 previously replicated off of socorro3.db.phx1 complete

Sheeri Cabral [:sheeri]

Reporter

Comment 8

•

12 years ago

The timeline: 16:19 < mpressman> FAILOVER warning notificaiton commencing at 5:15pm 17:08 < mpressman> solarce: https://etherpad.mozilla.org/failover-prep 17:09 < mpressman> it's basically hardhat -> stop services -> failover -> switch zeus -> test 17:16 < solarce> mpressman: done 17:17 < solarce> mpressman: with hardhat 17:17 < mpressman> solarce: sweet, sending a notice email now 17:17 < solarce> mpressman: gonna need a few to stop and downtime stuff 17:17 < mpressman> sure thing 17:20 < mpressman> FAILOVER MAINTENANCE NOTICE - Commencing now 17:25 < solarce> mpressman: done 17:26 < mpressman> solarce: sweet! I'm running the failover now 17:28 < solarce> rhelmer: can you spot check the collectors? 17:28 < rhelmer> solarce: ok, what are you up to? :) 17:28 < solarce> rhelmer: failing over master pgsql 17:28 < rhelmer> ah 17:29 < solarce> i have mware, django, and processors stopped 17:29 < solarce> just nervous ;) 17:29 < rhelmer> solarce: hmm so collectors wouldn't know anything about postgres right? 17:29 < rhelmer> i will check anyway 17:29 < solarce> rhelmer: 17:29 < solarce> rhelmer: no, i am just nervous 17:29 < rhelmer> they *should* be saving to disk, then crashmover sends to hbase and puts crashid in rabbitmq 17:29 < solarce> rhelmer: the one i am watching is saying it has nothing to do 17:29 < rhelmer> will check anyway :) 17:30 < rhelmer> solarce: there's been a big drop in network activity, that seems unexpected no? http://sp-admin01.phx1.mozilla.com/ganglia/graph_all_periods.php?c=Socorro%20Collectors&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1382488219&g=network_report&z=large&c=Socorro%20Collectors 17:31 < rhelmer> solarce: I do see crashes coming in to sp-collector01 17:32 < rhelmer> solarce: coming in via apache and then going out to hbase+rabbit, that seems fine, just drop on that graph is odd 17:33 < rhelmer> solarce: we have a normal rise and fall as we go in and out of peak times, but sharp drops are unusual 17:33 < rhelmer> solarce: looks like it just bounced back 17:33 < rhelmer> solarce: did you do anything? :) 17:34 < solarce> rhelmer: yes, instead of hardhatting crash-stats http and https i hard hatted crash-reports https and crash-stats https, i turned the crash-reports hardhat off 17:35 < rhelmer> solarce: ah ok yeah don't do that :P 17:35 < solarce> rhelmer: brb getting katana 17:35 < rhelmer> lol 17:36 < solarce> mpressman: how's it going? 17:36 < mpressman> solarce: almost there 17:38 < mpressman> solarce: ok, good to go 17:39 < solarce> mpressman: i am confused

Sheeri Cabral [:sheeri]

Reporter

Comment 9

•

12 years ago

oof, part 2 - meant to edit some of that stuff out... 17:40 < solarce> mpressman: nm 17:45 < solarce> mpressman: all db config changes in zeus done 17:46 < solarce> mpressman: i see two config lines that point to 101 17:46 < solarce> mpressman: those should go to 110 now? 17:47 < mpressman> solarce: um, no, I think we should revert :( I'm seeing errors in the postgres log for the new master 17:47 < mpressman> solarce: so I don't want the apps to write to it 17:47 < solarce> mpressman: ok 17:48 < mpressman> solarce: let's repoint to 101 as the rw and ro 17:48 < solarce> mpressman: rw switched back, what about ro? 17:48 < solarce> ok, moving ro to 101 too 17:48 < mpressman> solarce: 101 17:48 < solarce> mpressman: everything is one 101 17:48 < solarce> on* 17:49 < mpressman> solarce: thank you, I'm gonna check that's working 17:57 < mpressman> solarce: one final step before we can enable the services, give me 2 mins 18:02 < mpressman> solarce: ok, we should be good to enable the services 18:03 < mpressman> master02 logs look good, I'll rebuild the other hosts 18:03 < solarce> mpressman: ok 18:04 < solarce> mpressman: i fired up 1 processor and it seems happy 18:04 < mpressman> good deal 18:04 < mpressman> I'm seeing connections .... and os on. Looks like the problem was errors on the new master's postgres logs.

Flags: needinfo?(laura)

Sheeri Cabral [:sheeri]

Reporter

Comment 10

•

12 years ago

The error in the logs was due to the wal_archive script's variables that ships log files.

Matt Pressman [:mpressman]

Comment 11

•

12 years ago

socorro3.db.phx1 rebuild complete

Sheeri Cabral [:sheeri]

Reporter

Comment 12

•

12 years ago

As discussed yesterday, we will be trying the failover again on Monday, Nov 4th after hours.

Selena Deckelmann :selenamarie :selena

Updated

•

12 years ago

Blocks: 823507

Matt Pressman [:mpressman]

Comment 13

•

12 years ago

Commencing failover now

Matt Pressman [:mpressman]

Comment 14

•

12 years ago

failover complete - socorro1.db.phx1 is now the primary master

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Sheeri Cabral [:sheeri]

Reporter

Updated

•

12 years ago

Depends on: 822685

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Data & BI Services Team

Bugzilla

failover socorro2

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: scabral, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Comment 13

Comment 14

Updated

Updated