Closed
Bug 925033
Opened 12 years ago
Closed 12 years ago
failover socorro2
Categories
(Data & BI Services Team :: DB: MySQL, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: scabral, Unassigned)
References
Details
We need to take socorro2 out of the load balancer for bugs 913488 and 887462.
We should probably failover so socorro1 is the master, and socorro3 is the receiver
- socorro3 is currently the receiver, so that would have to be configured to receive from socorro1.
(I'm open to making socorro3 the master, since it's already the receiver in production, if that makes more sense than putting socorro1 directly in as the master).
Reporter | ||
Comment 1•12 years ago
|
||
We would like to do this Tuesday 10/15 if possible, but of course that's up to you.
Flags: needinfo?(laura)
Reporter | ||
Comment 2•12 years ago
|
||
Let's update https://mana.mozilla.org/wiki/display/websites/crash-stats.mozilla.com+master+database+failover with the change process before the date.
Reporter | ||
Comment 3•12 years ago
|
||
That date's not going to work out; we'll need CAB approval. Laura, please give us a date that works for you (I know there's some PTO coming up) and we'll file a CAB bug.
Reporter | ||
Comment 4•12 years ago
|
||
Failover is set for Tuesday Oct 22nd after 5 pm Pacific
From last time - https://etherpad.mozilla.org/failover-prep
Comment 5•12 years ago
|
||
Updated etherpad, removed notes from previous failover regarding disk upgrades and changed hostnames to reflect failover from tp-socorro01-master02
Comment 6•12 years ago
|
||
failover reverted before enabling any services or repointing the load balancer
Comment 7•12 years ago
|
||
rebuild of socorro-reporting1.db.phx1 previously replicated off of socorro3.db.phx1 complete
Reporter | ||
Comment 8•12 years ago
|
||
The timeline:
16:19 < mpressman> FAILOVER warning notificaiton commencing at 5:15pm
17:08 < mpressman> solarce: https://etherpad.mozilla.org/failover-prep
17:09 < mpressman> it's basically hardhat -> stop services -> failover -> switch zeus -> test
17:16 < solarce> mpressman: done
17:17 < solarce> mpressman: with hardhat
17:17 < mpressman> solarce: sweet, sending a notice email now
17:17 < solarce> mpressman: gonna need a few to stop and downtime stuff
17:17 < mpressman> sure thing
17:20 < mpressman> FAILOVER MAINTENANCE NOTICE - Commencing now
17:25 < solarce> mpressman: done
17:26 < mpressman> solarce: sweet! I'm running the failover now
17:28 < solarce> rhelmer: can you spot check the collectors?
17:28 < rhelmer> solarce: ok, what are you up to? :)
17:28 < solarce> rhelmer: failing over master pgsql
17:28 < rhelmer> ah
17:29 < solarce> i have mware, django, and processors stopped
17:29 < solarce> just nervous ;)
17:29 < rhelmer> solarce: hmm so collectors wouldn't know anything about postgres right?
17:29 < rhelmer> i will check anyway
17:29 < solarce> rhelmer:
17:29 < solarce> rhelmer: no, i am just nervous
17:29 < rhelmer> they *should* be saving to disk, then crashmover sends to hbase and puts crashid in rabbitmq
17:29 < solarce> rhelmer: the one i am watching is saying it has nothing to do
17:29 < rhelmer> will check anyway :)
17:30 < rhelmer> solarce: there's been a big drop in network activity, that seems unexpected no? http://sp-admin01.phx1.mozilla.com/ganglia/graph_all_periods.php?c=Socorro%20Collectors&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1382488219&g=network_report&z=large&c=Socorro%20Collectors
17:31 < rhelmer> solarce: I do see crashes coming in to sp-collector01
17:32 < rhelmer> solarce: coming in via apache and then going out to hbase+rabbit, that seems fine, just drop on that graph is odd
17:33 < rhelmer> solarce: we have a normal rise and fall as we go in and out of peak times, but sharp drops are unusual
17:33 < rhelmer> solarce: looks like it just bounced back
17:33 < rhelmer> solarce: did you do anything? :)
17:34 < solarce> rhelmer: yes, instead of hardhatting crash-stats http and https i hard hatted crash-reports https and crash-stats https, i turned the crash-reports hardhat off
17:35 < rhelmer> solarce: ah ok yeah don't do that :P
17:35 < solarce> rhelmer: brb getting katana
17:35 < rhelmer> lol
17:36 < solarce> mpressman: how's it going?
17:36 < mpressman> solarce: almost there
17:38 < mpressman> solarce: ok, good to go
17:39 < solarce> mpressman: i am confused
Reporter | ||
Comment 9•12 years ago
|
||
oof, part 2 - meant to edit some of that stuff out...
17:40 < solarce> mpressman: nm
17:45 < solarce> mpressman: all db config changes in zeus done
17:46 < solarce> mpressman: i see two config lines that point to 101
17:46 < solarce> mpressman: those should go to 110 now?
17:47 < mpressman> solarce: um, no, I think we should revert :( I'm seeing errors in the postgres log for the new master
17:47 < mpressman> solarce: so I don't want the apps to write to it
17:47 < solarce> mpressman: ok
17:48 < mpressman> solarce: let's repoint to 101 as the rw and ro
17:48 < solarce> mpressman: rw switched back, what about ro?
17:48 < solarce> ok, moving ro to 101 too
17:48 < mpressman> solarce: 101
17:48 < solarce> mpressman: everything is one 101
17:48 < solarce> on*
17:49 < mpressman> solarce: thank you, I'm gonna check that's working
17:57 < mpressman> solarce: one final step before we can enable the services, give me 2 mins
18:02 < mpressman> solarce: ok, we should be good to enable the services
18:03 < mpressman> master02 logs look good, I'll rebuild the other hosts
18:03 < solarce> mpressman: ok
18:04 < solarce> mpressman: i fired up 1 processor and it seems happy
18:04 < mpressman> good deal
18:04 < mpressman> I'm seeing connections
.... and os on.
Looks like the problem was errors on the new master's postgres logs.
Flags: needinfo?(laura)
Reporter | ||
Comment 10•12 years ago
|
||
The error in the logs was due to the wal_archive script's variables that ships log files.
Comment 11•12 years ago
|
||
socorro3.db.phx1 rebuild complete
Reporter | ||
Comment 12•12 years ago
|
||
As discussed yesterday, we will be trying the failover again on Monday, Nov 4th after hours.
Comment 13•12 years ago
|
||
Commencing failover now
Comment 14•12 years ago
|
||
failover complete - socorro1.db.phx1 is now the primary master
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•