Closed Bug 925033 Opened 11 years ago Closed 11 years ago

failover socorro2

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: scabral, Unassigned)

References

Details

We need to take socorro2 out of the load balancer for bugs 913488 and 887462. 

We should probably failover so socorro1 is the master, and socorro3 is the receiver

- socorro3 is currently the receiver, so that would have to be configured to receive from socorro1.

(I'm open to making socorro3 the master, since it's already the receiver in production, if that makes more sense than putting socorro1 directly in as the master).
We would like to do this Tuesday 10/15 if possible, but of course that's up to you.
Flags: needinfo?(laura)
That date's not going to work out; we'll need CAB approval. Laura, please give us a date that works for you (I know there's some PTO coming up) and we'll file a CAB bug.
Depends on: 926539
Failover is set for Tuesday Oct 22nd after 5 pm Pacific

From last time - https://etherpad.mozilla.org/failover-prep
Updated etherpad, removed notes from previous failover regarding disk upgrades and changed hostnames to reflect failover from tp-socorro01-master02
failover reverted before enabling any services or repointing the load balancer
rebuild of socorro-reporting1.db.phx1 previously replicated off of socorro3.db.phx1 complete
The timeline:

16:19 < mpressman> FAILOVER warning notificaiton commencing at 5:15pm
17:08 < mpressman> solarce: https://etherpad.mozilla.org/failover-prep
17:09 < mpressman> it's basically hardhat -> stop services -> failover -> switch zeus -> test
17:16 < solarce> mpressman: done
17:17 < solarce> mpressman: with hardhat
17:17 < mpressman> solarce: sweet, sending a notice email now
17:17 < solarce> mpressman: gonna need a few to stop and downtime stuff
17:17 < mpressman> sure thing
17:20 < mpressman> FAILOVER MAINTENANCE NOTICE - Commencing now
17:25 < solarce> mpressman: done
17:26 < mpressman> solarce: sweet! I'm running the failover now
17:28 < solarce> rhelmer: can you spot check the collectors?
17:28 < rhelmer> solarce: ok, what are you up to? :)
17:28 < solarce> rhelmer: failing over master pgsql
17:28 < rhelmer> ah
17:29 < solarce> i have mware, django, and processors stopped
17:29 < solarce> just nervous ;)
17:29 < rhelmer> solarce: hmm so collectors wouldn't know anything about postgres right?
17:29 < rhelmer> i will check anyway
17:29 < solarce> rhelmer:
17:29 < solarce> rhelmer: no, i am just nervous
17:29 < rhelmer> they *should* be saving to disk, then crashmover sends to hbase and puts crashid in rabbitmq
17:29 < solarce> rhelmer: the one i am watching is saying it has nothing to do
17:29 < rhelmer> will check anyway :)
17:30 < rhelmer> solarce: there's been a big drop in network activity, that seems unexpected no? http://sp-admin01.phx1.mozilla.com/ganglia/graph_all_periods.php?c=Socorro%20Collectors&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1382488219&g=network_report&z=large&c=Socorro%20Collectors
17:31 < rhelmer> solarce: I do see crashes coming in to sp-collector01
17:32 < rhelmer> solarce: coming in via apache and then going out to hbase+rabbit, that seems fine, just drop on that graph is odd
17:33 < rhelmer> solarce: we have a normal rise and fall as we go in and out of peak times, but sharp drops are unusual
17:33 < rhelmer> solarce: looks like it just bounced back
17:33 < rhelmer> solarce: did you do anything? :)
17:34 < solarce> rhelmer: yes, instead of hardhatting crash-stats http and https i hard hatted crash-reports https and crash-stats https, i turned the crash-reports hardhat off
17:35 < rhelmer> solarce: ah ok yeah don't do that :P
17:35 < solarce> rhelmer: brb getting katana
17:35 < rhelmer> lol
17:36 < solarce> mpressman: how's it going?
17:36 < mpressman> solarce: almost there
17:38 < mpressman> solarce: ok, good to go
17:39 < solarce> mpressman: i am confused
oof, part 2 - meant to edit some of that stuff out...

17:40 < solarce> mpressman: nm
17:45 < solarce> mpressman: all db config changes in zeus done
17:46 < solarce> mpressman: i see two config lines that point to 101
17:46 < solarce> mpressman: those should go to 110 now?
17:47 < mpressman> solarce: um, no, I think we should revert :( I'm seeing errors in the postgres log for the new master
17:47 < mpressman> solarce: so I don't want the apps to write to it
17:47 < solarce> mpressman: ok
17:48 < mpressman> solarce: let's repoint to 101 as the rw and ro
17:48 < solarce> mpressman: rw switched back, what about ro?
17:48 < solarce> ok, moving ro to 101 too
17:48 < mpressman> solarce: 101
17:48 < solarce> mpressman: everything is one 101
17:48 < solarce> on*
17:49 < mpressman> solarce: thank you, I'm gonna check that's working
17:57 < mpressman> solarce: one final step before we can enable the services, give me 2 mins
18:02 < mpressman> solarce: ok, we should be good to enable the services
18:03 < mpressman> master02 logs look good, I'll rebuild the other hosts
18:03 < solarce> mpressman: ok
18:04 < solarce> mpressman: i fired up 1 processor and it seems happy
18:04 < mpressman> good deal
18:04 < mpressman> I'm seeing connections

.... and os on.

Looks like the problem was errors on the new master's postgres logs.
Flags: needinfo?(laura)
The error in the logs was due to the wal_archive script's variables that ships log files.
socorro3.db.phx1 rebuild complete
As discussed yesterday, we will be trying the failover again on Monday, Nov 4th after hours.
Commencing failover now
failover complete - socorro1.db.phx1 is now the primary master
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Depends on: 822685
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.