Closed Bug 925033 Opened 7 years ago Closed 7 years ago
We need to take socorro2 out of the load balancer for bugs 913488 and 887462. We should probably failover so socorro1 is the master, and socorro3 is the receiver - socorro3 is currently the receiver, so that would have to be configured to receive from socorro1. (I'm open to making socorro3 the master, since it's already the receiver in production, if that makes more sense than putting socorro1 directly in as the master).
We would like to do this Tuesday 10/15 if possible, but of course that's up to you.
Let's update https://mana.mozilla.org/wiki/display/websites/crash-stats.mozilla.com+master+database+failover with the change process before the date.
That date's not going to work out; we'll need CAB approval. Laura, please give us a date that works for you (I know there's some PTO coming up) and we'll file a CAB bug.
Failover is set for Tuesday Oct 22nd after 5 pm Pacific From last time - https://etherpad.mozilla.org/failover-prep
Updated etherpad, removed notes from previous failover regarding disk upgrades and changed hostnames to reflect failover from tp-socorro01-master02
failover reverted before enabling any services or repointing the load balancer
rebuild of socorro-reporting1.db.phx1 previously replicated off of socorro3.db.phx1 complete
The timeline: 16:19 < mpressman> FAILOVER warning notificaiton commencing at 5:15pm 17:08 < mpressman> solarce: https://etherpad.mozilla.org/failover-prep 17:09 < mpressman> it's basically hardhat -> stop services -> failover -> switch zeus -> test 17:16 < solarce> mpressman: done 17:17 < solarce> mpressman: with hardhat 17:17 < mpressman> solarce: sweet, sending a notice email now 17:17 < solarce> mpressman: gonna need a few to stop and downtime stuff 17:17 < mpressman> sure thing 17:20 < mpressman> FAILOVER MAINTENANCE NOTICE - Commencing now 17:25 < solarce> mpressman: done 17:26 < mpressman> solarce: sweet! I'm running the failover now 17:28 < solarce> rhelmer: can you spot check the collectors? 17:28 < rhelmer> solarce: ok, what are you up to? :) 17:28 < solarce> rhelmer: failing over master pgsql 17:28 < rhelmer> ah 17:29 < solarce> i have mware, django, and processors stopped 17:29 < solarce> just nervous ;) 17:29 < rhelmer> solarce: hmm so collectors wouldn't know anything about postgres right? 17:29 < rhelmer> i will check anyway 17:29 < solarce> rhelmer: 17:29 < solarce> rhelmer: no, i am just nervous 17:29 < rhelmer> they *should* be saving to disk, then crashmover sends to hbase and puts crashid in rabbitmq 17:29 < solarce> rhelmer: the one i am watching is saying it has nothing to do 17:29 < rhelmer> will check anyway :) 17:30 < rhelmer> solarce: there's been a big drop in network activity, that seems unexpected no? http://sp-admin01.phx1.mozilla.com/ganglia/graph_all_periods.php?c=Socorro%20Collectors&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1382488219&g=network_report&z=large&c=Socorro%20Collectors 17:31 < rhelmer> solarce: I do see crashes coming in to sp-collector01 17:32 < rhelmer> solarce: coming in via apache and then going out to hbase+rabbit, that seems fine, just drop on that graph is odd 17:33 < rhelmer> solarce: we have a normal rise and fall as we go in and out of peak times, but sharp drops are unusual 17:33 < rhelmer> solarce: looks like it just bounced back 17:33 < rhelmer> solarce: did you do anything? :) 17:34 < solarce> rhelmer: yes, instead of hardhatting crash-stats http and https i hard hatted crash-reports https and crash-stats https, i turned the crash-reports hardhat off 17:35 < rhelmer> solarce: ah ok yeah don't do that :P 17:35 < solarce> rhelmer: brb getting katana 17:35 < rhelmer> lol 17:36 < solarce> mpressman: how's it going? 17:36 < mpressman> solarce: almost there 17:38 < mpressman> solarce: ok, good to go 17:39 < solarce> mpressman: i am confused
oof, part 2 - meant to edit some of that stuff out... 17:40 < solarce> mpressman: nm 17:45 < solarce> mpressman: all db config changes in zeus done 17:46 < solarce> mpressman: i see two config lines that point to 101 17:46 < solarce> mpressman: those should go to 110 now? 17:47 < mpressman> solarce: um, no, I think we should revert :( I'm seeing errors in the postgres log for the new master 17:47 < mpressman> solarce: so I don't want the apps to write to it 17:47 < solarce> mpressman: ok 17:48 < mpressman> solarce: let's repoint to 101 as the rw and ro 17:48 < solarce> mpressman: rw switched back, what about ro? 17:48 < solarce> ok, moving ro to 101 too 17:48 < mpressman> solarce: 101 17:48 < solarce> mpressman: everything is one 101 17:48 < solarce> on* 17:49 < mpressman> solarce: thank you, I'm gonna check that's working 17:57 < mpressman> solarce: one final step before we can enable the services, give me 2 mins 18:02 < mpressman> solarce: ok, we should be good to enable the services 18:03 < mpressman> master02 logs look good, I'll rebuild the other hosts 18:03 < solarce> mpressman: ok 18:04 < solarce> mpressman: i fired up 1 processor and it seems happy 18:04 < mpressman> good deal 18:04 < mpressman> I'm seeing connections .... and os on. Looks like the problem was errors on the new master's postgres logs.
The error in the logs was due to the wal_archive script's variables that ships log files.
socorro3.db.phx1 rebuild complete
As discussed yesterday, we will be trying the failover again on Monday, Nov 4th after hours.
Commencing failover now
failover complete - socorro1.db.phx1 is now the primary master
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.