Move to new Bouncer code and infra



6 years ago
5 years ago


(Reporter: laura, Assigned: nmaul)





6 years ago
Steps are as follows:

1. Deploy database schema changes to existing production database.  Changes are as follows:

-- add fallback region options (bug 613620)
ALTER TABLE `geoip_regions` ADD COLUMN `fallback_id` integer;
ALTER TABLE `geoip_regions` ADD CONSTRAINT `fallback_id_refs_id_e6bfe66d` FOREIGN KEY (`fallback_id`) REFERENCES `geoip_regions` (`id`);
CREATE INDEX `geoip_regions_e28329c2` ON `geoip_regions` (`fallback_id`);
ALTER TABLE geoip_regions ADD COLUMN prevent_global_fallback int(1) NULL;

-- Add SSL only support (bug 796088)
ALTER TABLE mirror_products ADD COLUMN `ssl_only` tinyint(1) NOT NULL DEFAULT 0;

2.  WebQA to test (non-destructively) on new cluster, since it is using the ACTUAL PROD DATABASE.  You may add test products and test only SSL-only mirrors as this should be safe.  Do not edit any existing products or mirrors.   Be sure to delete test data (ONLY) when you are done.

3. When WebQA signs off, switch from old cluster to new cluster, via Zeus.

4.  After a period of testing and watching logs, via deinspanjer (at least several hours we will sign off on the new cluster (Daniel, how long do we expect this to take?  Is 3pm PT too early?)

5.  Releng to add new SSL-only products (stubinstaller) to Bouncer.
The plan looks good to me; this is the same testing we've done on staging, and helps guarantee a sane level of coverage for both positive and negative tests.
Access logs from the Zeus load balancers that sit in front of Bouncer ( are rolled over hourly, and they are transferred to the log file server (metrics-logger1) via an rsync job.  This transfer typically has about a 4 hour lag.  Sometimes it can be as little as 2 hours, very rarely is it more than 5.

This means that requests that happen during the 9am hour (Pacific) can be viewed by Metrics between 11am and 2pm, with the most typical time being 12pm.

We would like to have at least 2 hours of log data to be able to trend and compare with the pre-cutover hours as well as with the same hours from the previous day.

So, take your cut-over time, add 2 to 3 hours for trending, and 2 to 3 hours for log collection, and you will have the earliest possible time we could give you results on the impact to incoming requests.

Finally, please keep in mind that these access logs show only redirect events.  They don't show where the redirect went or whether the download was completed.  If there is a systemic problem with the new Bouncer code, it is most likely to be in the second half that Metrics doesn't see.  Either the requests get redirected to the wrong place, or the destination doesn't deliver the client a working installer.  You need to coordinate with the CDNs or Ops to try to get verification of that part of the pipeline.

Comment 3

6 years ago
Jake says, re CDN:
We will have partial data immediately; estimate of bandwidth within 5-20 minutes, guaranteed numbers in ~5 hours, complete analytics in ~5 days.

We will watch the CDN numbers immediately.
We will have enough information, on both fronts, to sign off positively, 5 hours after ship.

Comment 4

6 years ago
Schema changes (comment 0 step 1) completed... this was tested as not affecting current prod (as well as visually, it's just new columns / indexes, and current code will ignore them).

Comment 5

6 years ago
Cut-over details:

change the zeus pool in use for http/https vservers to new prod pools

change dns for to be a CNAME to

get a cert for, add to zeus as an TLS SNI cert on that vhost
    ultimately should become a real, purchased cert
    right now, can be the cert from the sentry node... signed by Mozilla Root CA
    can be done anytime

remove "new bouncer" temp VIP of from DNS and Zeus
Web QA has looked at this on both staging (yesterday/today), as well as the Bouncer Admin pieces on production, and tested the changes in aliasing and fallback functionality; we're ready to ship!

Let's do this thing.

Comment 7

6 years ago
This is all completed.
Last Resolved: 6 years ago
Resolution: --- → FIXED
Verified FIXED; let the smoke (dust, really) completely settle before I verified this.
Component: Server Operations: Web Operations → WebOps: Other
Product: → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.