Closed Bug 683446 Opened 13 years ago Closed 12 years ago

Add PHX instance of aus3.m.o to production

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P2)

x86
All

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: cshields)

References

Details

Attachments

(1 file)

Over in bug 658066 the setup of aus3.m.o in PHX was done and verified, but we didn't take the last step and enable it in production. We had multiple-colo coverage for aus2.m.o, but don't know that that redirects to aus3.m.o.
Assignee: server-ops → jeremy.orem+bugs
aus3.3crowd.mozilla.net is set up and ready to go. The final step will be to CNAME aus3.mozilla.org there. 


We are currently experiencing some problems with load in phx, so I don't want to send more traffic there right now.
I'll be out on paternity leave before we can make this change.

Corey, once PHX has stabilized can you flip DNS for aus3 to aus3.3crowd.mozilla.net or assign this out?
Assignee: jeremy.orem+bugs → cshields
Well, I thought we had the phx issues stabilized last thursday but today when logging in to zeus I noticed some possible issues..  Still looking into this.
Component: Server Operations → Server Operations: Web Operations
QA Contact: mrz → cshields
nthomas, has aus3 in phx1 been tested/vetted?  I know we had to do this for aus2, just want to make sure we are okay to do this before throwing the switch.
Corey, I did at bug 658066 comment #12. I'll rerun those and add some updated checks.
I ran some tests shortly after comment #5 and hit some issues, but they might just be related to my test rig. Dustin says we need to move to PHX1 to exit SJC1, and that may help with load on mpt-netapp-b load too.
Assignee: cshields → nrthomas
Priority: -- → P2
For the record, we just need to move out of sjc1 -- if that's to scl3 and not phx1, that's fine (but can't happen today, obviously)
nthomas, any update here?
My week got kinda blown out of the water by the chemspills, but I'll look this coming week.
Corey will need to schedule the dist cluster in sjc1 onto a train into scl3.  The next two available trains are 3/12, 3/26, and 4/9.  The cutoff to get on the first is this Friday at noon PST.  Do you think we could be certain by Friday that this will work by 3/12?
Depends on: 729859
I should have results for the full set of tests in a few hours, but initial results show that aus03.zlb.phx.mozilla.net is working fairly well in comparison to current prod of aus3.m.o (aka aus3.acelb.sj.mozilla.com). 

The exceptions are
* pp-app-dist07 always returns an empty update rather than content, while several other machines pp-app-distNN machines are working fine. For a test url you could use this to test
 https://aus3.mozilla.org/update/1/Firefox/10.0.1/20120208060813/WINNT_x86-msvc/en-US/release/update.xml
* some snippets missing in PHX1, RelEng to fix in bug 729859). Shouldn't be hard to resolve that

Given we had aus2.m.o setup in PHX1 previously, and I'm getting a lot of PASS results for the tests that have run so far, I think we can be confident of hitting the first train.
pp-app-dist07 might just have a stale config.php (copied from config-dist.php). I think it must have rev 1.186 or older of 
 http://bonsai.mozilla.org/cvsview2.cgi?subdir=mozilla/webtools/aus/xml/inc&files=config-dist.php&command=DIRECTORY&branch=HEAD&root=/cvsroot
A quick look at the configs on 07 and 06 confirms your suspicion.  I'm not sure what the "approved" way to fix this is (or why it went wrong) -- Jake?
This is fixed. The update job was hung, and had been for ~192 days, it seems. I killed it and it runs properly now. This should be working again.

Not resolving this bug because I'm not sure what all is left to do on it.
Thanks, Jake!  We'll see how Nick's tests finish out.
Attached file tests
pp-app-dist07 looks good now, thanks. Are there nagios checks (or equivalent) to make sure hung update jobs are noticed ?

I reran the test suite with some updated tests and it all passed, except for 5 requests which came back with no updates from sjc1 and updates from phx1. That wasn't repeatable. The tests are attached here - they're specifically checking updates served for recent and some end-of-branch releases, SSL certificates, and blocking of deprecated OS & hardware.

Since the tests only hit a small subset of the versions in the wild I've confirmed that the same snippets are present in both locations (using a set of rsync's).

I've also checked that Firefox is happy connecting to Phoenix (by forcing the IP using /etc/hosts), so the CA pinning for https has some extra confirmation.

All the testing was against aus03.zlb.phx.mozilla.net/63.245.217.44

--> signing off on bringing phx1 online for aus3.m.o
Assignee: nrthomas → cshields
Blocks: 728441
No longer depends on: 728441
Jake suggests this is as simple as jeremy suggested in comment 1:

-aus3            60 IN A     63.245.209.149
+aus3            60 IN CNAME aus3.3crowd.mozilla.net

(with SOA update of course).  I can certainly do this, and can undo it just as quickly if things go wrong.  Should I pull the trigger tomorrow (Thursday) morning, or is there more to it?
Whoa no, don't do that. It's not set up in 3crowd anymore... that was months ago, and we've mostly migrated away from that.

If we're not trying to multi-host this just yet, then just change the A record to 63.245.217.44 per comment 16. This seems most likely to me since we're about to vacate SJC1. Once SCL3 is online, we can look into multi-hosting it. :)
Is there any merit in using both DCs as a transition given the issues with firewalls in PHX1 ? What's the current state of that issue ?
We're not currently set up for cross-site load balancing, but this change is the sort of thing that could be reverted quickly if necessary.  The phx1 issues are well-handled and -isolated at this point, and we're moving a number of other services to phx1, so I don't think this is a big concern.  Jake, please add/correct if I've missed something.
Having not built this nor really knowing how it works, I'm not comfortable making this change.  That said, we should get this done soon so we can ship the sjc1 dist cluster to scl3.
Before we change this, we need to find out what "aus3.acelb.sj.mozilla.com" is, I think. They both go to the same IP (63.245.209.149).

external/mozilla.org-update:aus3       60  IN  A      63.245.209.149
external/sj.mozilla.com:aus3.acelb.sj      IN  A      63.245.209.149

This might be a vestigial record left over from some previous configuration, no longer used... not entirely sure. I don't see a similar record for any other datacenter... possibly this was created some time ago when aus3 first started to get spun up in PHX1, for the purpose of having a way to distinguish the 2 locations if necessary? If that's the case, it probably should be removed. But if it's important for something, then it should be changed to point to the new PHX1 IP (or CNAME).

Metrics has also requested that the change be made after 1600 PST, if possible... this is 0000 UTC, and so gives them a full day to correct any data collection problems before the daily reports are created.
I don't know why there isn't a CNAME from aus3.m.o to aus3.acelb.sj.mozilla.com instead of another A record, but that really falls into whatever rules you guys have about how DNS should be set up.
I think we've polled all of the people who could potentially anticipate any problems here.  I'll pull the trigger on Monday after 16:00 pacific, after making sure nthomas is around to double-check things.
This is one of those do-it-calmly-now - or - firedrill-it-later kinds of things.  And we will have no shortage of firedrills.
(In reply to Dustin J. Mitchell [:dustin] from comment #24)
> I think we've polled all of the people who could potentially anticipate any
> problems here.  I'll pull the trigger on Monday after 16:00 pacific, after
> making sure nthomas is around to double-check things.

So, the original suggested timing here still makes sense, i.e. Monday 16:00 PST. This type of change would need to happen on a weekend or a Monday to avoid impact to releases.

Would we want to try this next Monday, April 9th, assuming Nick is around to help on the releng side?
Do we need to make a hard switch, or add a CNAME from aus3.m.o to aus3.acelb.sj.mozilla.com as Nick suggested in comment 23? A hard switch over sounds scary :)
That would really depend on the timescale for turning of pm-app-distNN which are currently serving aus3.m.o out of SJC1. cshields ?
Idea: we could move aus3.m.o to Cedexis or 3crowd, and do a phased cut-over to PHX1. We have a weighted-round-robin script already good-to-go for Cedexis... It's a simple matter to start with, say, 10% of load, make sure everyone is happy, and ramp it up. Once PHX1 is 100% of the load, we can simply revert it to a normal/simple DNS setup.

Typically the only concern with this situation is if anyone in Metrics will care about it. But that's a concern regardless of how we actually manage the cut-over.
A gentle transition sounds great to us in RelEng too. CC'ing Daniel from Metrics for his take.
(In reply to Jake Maul [:jakem] from comment #22)
> Metrics has also requested that the change be made after 1600 PST, if
> possible... this is 0000 UTC, and so gives them a full day to correct any
> data collection problems before the daily reports are created.

Daniel, given the above for a hard cutover from SJC1 to PHX1 for aus3.m.o traffic, would you ask for anything different if we're doing a gentle transition a la comment #29 ?
I've reverified the tests in comment #16, and that the data is identical in sjc1 and phx1. So this is RelEng signing off on sending ~10% of traffic to phx1.

Still pending
* go from metrics (not necessary if we change at 1700 PDT/0000 UTC as already agreed for a full cut over ?)
* set up 'round-robin' suggested in comment #29, test, deploy
Could you please provide a quick synopsis of what things will look like after the change so I can be sure we don't miss anything?  Here are the questions I'd like to make sure are clearly answered:

When will the change take effect?  Is there a period where we will go from A to AB to B or is it a straight cutoff?

Will any of the old servers be turned off?  The current active list is:
datacenter      site    serverdatacenter      site    server  basedir filename_prefix
sjc aus     zlb01-aus2      /mnt/stats_im-log03/stats/logs/zlb01.nms.mozilla.org    aus2.mozilla.org.access_
sjc aus     zlb02-aus2      /mnt/stats_im-log03/stats/logs/zlb02.nms.mozilla.org    aus2.mozilla.org.access_
sjc aus     zlb03-aus2      /mnt/stats_im-log03/stats/logs/zlb03.nms.mozilla.org    aus2.mozilla.org.access_
sjc aus     zlb01-aus3      /mnt/stats_im-log03/stats/logs/zlb01.nms.mozilla.org    aus3.mozilla.org.access_
sjc aus     zlb02-aus3      /mnt/stats_im-log03/stats/logs/zlb02.nms.mozilla.org    aus3.mozilla.org.access_
sjc aus     zlb03-aus3      /mnt/stats_im-log03/stats/logs/zlb03.nms.mozilla.org    aus3.mozilla.org.access_


What is the above info for the new servers that will be actively serving traffic?

Are there any servers that might automatically start serving traffic in the event of a server outage? (We consider these passive servers and we look for logs but don't require them.)
I don't know the answers to most of that, except that RelEng would like to do A -> AB -> B, for A=sjc1 and B=phx1. The sjc1 machines will have to get turned off when we evacuate that colo. I'm surprised that you don't have any aus2 in phx listed, we were using that for some time before redirecting traffic to aus3.

cshields/jakem for the rest ?
aus2.mozilla.org is indeed being served out of SJC1 and PHX1 right now... although as you mentioned, it simply redirects to aus3.mozilla.org.


(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #33)
> When will the change take effect? 

Soon, but I don't have an exact date. Dustin might have an answer for that.


> Is there a period where we will go from A to AB to B or is it a straight cutoff?

Sounds like Rel-Eng has a healthy preference to the former, so we'll do that.


> Will any of the old servers be turned off?  The current active list is:
> datacenter      site    serverdatacenter      site    server  basedir
> filename_prefix
> sjc aus     zlb01-aus2     
> /mnt/stats_im-log03/stats/logs/zlb01.nms.mozilla.org   
> aus2.mozilla.org.access_
> sjc aus     zlb02-aus2     
> /mnt/stats_im-log03/stats/logs/zlb02.nms.mozilla.org   
> aus2.mozilla.org.access_
> sjc aus     zlb03-aus2     
> /mnt/stats_im-log03/stats/logs/zlb03.nms.mozilla.org   
> aus2.mozilla.org.access_
> sjc aus     zlb01-aus3     
> /mnt/stats_im-log03/stats/logs/zlb01.nms.mozilla.org   
> aus3.mozilla.org.access_
> sjc aus     zlb02-aus3     
> /mnt/stats_im-log03/stats/logs/zlb02.nms.mozilla.org   
> aus3.mozilla.org.access_
> sjc aus     zlb03-aus3     
> /mnt/stats_im-log03/stats/logs/zlb03.nms.mozilla.org   
> aus3.mozilla.org.access_

All of these are in SJC1... they will *all* be shut off, ultimately. There will be a transition period where these are in use simultaneously with the relevant PHX1 Zeus nodes (pp-zlb08.phx.mozilla.net - pp-zlb12.phx.mozilla.net).


> What is the above info for the new servers that will be actively serving
> traffic?

I don't have this info... at least, not yet. Dustin might, or we can look it up. I don't know much about the whole im-log02/03 metrics/stats system.

> Are there any servers that might automatically start serving traffic in the
> event of a server outage? (We consider these passive servers and we look for
> logs but don't require them.)

Yes- most likely, one of the 5 servers (pp-zlb08 through 12) will be active and the other 4 will be passive. I don't know which is (or will be) active and which will be passive, though.
(In reply to Jake Maul [:jakem] from comment #35)
> Soon, but I don't have an exact date. Dustin might have an answer for that.

ASAP - nothing's blocking that I know of.  The cluster this is running on (dist) isn't scheduled on a train, but will certainly find itself on one very soon, and is also tied up with the product distribution problem, which is thorny and likely to come down to the wire.

> > What is the above info for the new servers that will be actively serving
> > traffic?
> 
> I don't have this info... at least, not yet. Dustin might, or we can look it
> up. I don't know much about the whole im-log02/03 metrics/stats system.

I don't either.  This was mentioned 8 months ago in bug 658066 comment 26, but it doesn't look like anyone followed up on it.  Zeus says:

This virtual server is writing request logs to the following file:
  /var/log/zeus/aus3.mozilla.org.access_%{%Y-%m-%d-%H}t

if that's the "look it up" you're referring to.
Belatedly, I have realized that we will have to hard cut over to phx1 because of our staging server (dm-ausstage01) is going away this week. Without it we can't publish content to the sjc1 webheads. Sorry for the brainfade turning this into a rush.
Severity: normal → major
Rollout proposal:
* Thursday day 
 * metrics & IT check log processing is set up
 * IT prep DNS change, I think we want this
     aus2.m.o to point at aus01.zlb.phx.mozilla.net (currently in Cedexis with sjc1)
     aus3.m.o to point at aus03.zlb.phx.mozilla.net (currently in our DNS)
* Thursday 1700 PDT/Friday 0000 UTC - change aus2.m.o and aus3.m.o in DNS
  * nthomas to verify aus3.m.o requests OK
  * metrics to confirm logs processing working
* Thursday evening PDT 
  * nthomas lands changes to buildbot and verifies
  * final cron job setup
* Friday - nthomas confirms dm-ausstage01 ready for power off (what time do you need this?)

There's a Firefox beta shipping on Friday IIRC, so we should be clear of that.
I'm good to handle this change this evening.
(In reply to Nick Thomas [:nthomas] from comment #38)
> * Friday - nthomas confirms dm-ausstage01 ready for power off (what time do
> you need this?)

The drop-dead time is about 9am Pacific on Monday.  So, end of the business day on Friday would be great, but if you want a just-in-case it can wait a bit into the weekend.
Per irc, DNS updated and confirmed through internal, 8.8.8.8, and 4.2.2.1

bburton@andesite ~/code/mozilla/sysadmins/dnsconfig$ svn ci -m "aus2/3 changes for bug 683446"                    ✭master ‹1.9.2-p290›
Sending        dnsconfig/external/mozilla.org-soa
Sending        dnsconfig/external/mozilla.org-update
Transmitting file data ..
Committed revision 33931.
The change looks good from here - I requested a bunch of updates for Firefox en-US across many versions and got the same content back before and after.
Lets call this done and deal with any metrics issues elsewhere.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Things look fine from Metrics perspective. pp-zlb10 is serving the traffic, but we are also monitoring 08-12 for passive logs.
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: