Closed Bug 648146 Opened 9 years ago Closed 9 years ago

Hook up StAMN to Edgecast CDN

Categories

(mozilla.org Graveyard :: Server Operations, task, P2)

All
Other

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: clouserw, Assigned: nmaul)

References

Details

Attachments

(1 file)

Right now static.addons.mozilla.net is hosted on Mozilla servers.  We'd like to begin using edgecast to host this content.  There shouldn't be any change on the AMO side unless we need to change the actual domain name, and then it's just a change in settings.

From an implementation point of view, we should make this change on preview.amo first and get QA sign off, and then schedule the actual change in one of our weekly pushes.
I'm happy to edit my HOSTS file to start testing -- what IP do I want to map to static.addons.mozilla.net?
Assignee: server-ops → nmaul
I am not entirely sure what all is needed to set this up, but I have created this Edgecast property:

http://wac.1237.edgecastcdn.net/801237/static.addons.mozilla.net

As its origin, it uses:

http://static.addons.mozilla.net:80


This URL is currently returning a 403 Forbidden when accessed directly, but I'm fairly sure that's because it's 404'ing on the index page and then 403'ing on the 404 ErrorDoc page. I'm getting a 404 at the Edgecast URL above... there might be a propagation delay, or it could just be returning the first error and not following the ErrorDoc.

Do we have a sample URL that returns valid content we could test with?
Status: NEW → ASSIGNED
passed along setup email from edgecast
Unless it is static content, it will be a 403.  Static content is stuff like JS, CSS, images, etc.  You can see the apache rule for details there.  An example: https://static.addons.mozilla.net/en-US/firefox/pages/js_constants.js
Note: I have also set up a DNS CNAME for this, in DNS and in Edgecast's system:

http://static-cdn.addons.mozilla.net:80/

This is still propagating, but should work shortly.
(In reply to comment #5)
> Note: I have also set up a DNS CNAME for this, in DNS and in Edgecast's system:
> 
> http://static-cdn.addons.mozilla.net:80/
> 
> This is still propagating, but should work shortly.

This needs to revert - this was setup as a normal site, not an ADN site.
Oh, I missed that this was using a non-production URL.  In any event, it's not setup as an ADN in EdgeCast and still needs SSL cert.
I have reverted the static-cdn.addons.mozilla.net CNAME record, and removed all of the normal CDN setup config on Edgecast's side. Effectively this is back to before I did anything.

As for the ADN setup, I need some additional information.

The way this looks to be configured (from mrz's setup email, from Edgecast), static.addons.mozilla.net actually lives on Edgecast's servers, not ours. That is, s.a.m.n is the ADN name, not the origin. Do we know what the Origin FQDN is supposed to be for the static.addons.mozilla.net property?
One other thing.. whatever the origin is, I need to have this 5k "sample asset", or a similarly-sized static content image/item uploaded to it. Once it (or a similar thing) is uploaded, please let me know the URL where I can access it.

As far as I can tell, any ~5k image should work.
I'm pretty sure "addons.mozilla.org" works as an origin server for this.  If you wish to use a different hostname you just need to tell EdgeCast to provision for that instead.
(In reply to comment #12)
> I'm pretty sure "addons.mozilla.org" works as an origin server for this.  If
> you wish to use a different hostname you just need to tell EdgeCast to
> provision for that instead.

The apache rules for addons.mozilla.org and static.addons.mozilla.net are different (as mentioned in comment 4).  If it's mirroring origin, comment 12 is wrong.  If not, nevermind. :)
A given FQDN can either be a CDN property or an origin... DNS can't point to both in any sane/maintainable manner. So if static.addons.mozilla.net is a CDN name, then it cannot also be the origin, and vice versa. Make sense?

I see that the code on a.m.o is already using static.addons.mozilla.net for static content. With that in mind, the easiest way to start using a CDN is to create static-cdn.addons.mozilla.net as a CDN property, with static.addons.mozilla.net as the origin.

Once this is done, the code for a.m.o. will need to change so that any links given out in the HTML are "static-cdn" instead of just static. I'm referring to 
"<img src=https://static.addons.mozilla.net/..." tags and the like.



We can also use static.addons.mozilla.net as the CDN name, but then we'd have to deal with renaming it for use as the origin. That seems more complicated, and I'm not sure what the benefit would be.
Adjusting those links on AMO just means changing the STATIC_URL variable in the settings file, so that's all on IT's end.
I have created the property "static-cdn.addons.mozilla.net", with the origin of "static.addons.mozilla.net". It will take up to an hour for propagation.

Once that's done, you can immediately change a.m.o. code to point to static-cdn.addons.mozilla.net instead. I've already updated DNS, that should be propagated by the time the CDN is ready to go.


For the aforementioned validation item, I used this:

https://static.addons.mozilla.net/media/img/zamboni/global/bg-header.png

This is okay for testing, but for production we'll want to make sure to use something that we can guarantee will not go away or get moved (and thus mysteriously break the CDN).
(In reply to comment #15)
> Adjusting those links on AMO just means changing the STATIC_URL variable in the
> settings file, so that's all on IT's end.

Excellent... when we're ready to test, I'll work with QA / Jeremy to get that done for one of the stage sites.
Cool.  Note that the stage sites (like addons.allizom.org) are using addons-cdn.allizom.org for their static content.  We should test staging with their domains so the code lines up with the static content.
Origin:
https://static.addons.mozilla.net/media/img/zamboni/global/bg-header.png
Works (obviously).

CDN CNAME:
https://static-cdn.addons.mozilla.net/media/img/zamboni/global/bg-header.png
Works, but bad cert- Edgecast is presenting a cert for their own FQDN.

CDN w/ Edgecast's name:
https://gs1.adn.edgecastcdn.net/801237/static-cdn.addons.mozilla.net/media/img/zamboni/global/bg-header.png
This works properly.


I will work with Edgecast to see about using a valid cert for static-cdn.addons.mozilla.net... that would be ideal. We have a working solution without that, just by using the domain name they provide.
Priority: -- → P2
Blocks: 635610
Existing Edgecast ticket number is 26258. Craig Kaplan has updated it with the new name 'static-cdn.addons.mozilla.net' as the CDN property name. Waiting on them now.
(In reply to comment #18)
> Cool.  Note that the stage sites (like addons.allizom.org) are using
> addons-cdn.allizom.org for their static content.  We should test staging with
> their domains so the code lines up with the static content.

Do you mean, setup a separate staging ADN property, and point addons.allizom.org at it using the Edgecast FQDN? That sounds like a good idea to me.

We can nix the current 'addons-cdn.allizom.org' DNS record (it's just a CNAME to addons.allizom.org anyway), and set the STATIC_URL variable for stage over to the appropriate Edgecast URL... like the one above in comment 19, but for the stage property instead.

Sound good? This way you guys can at least get some QA work in while the main prod domain gets configured.
Sounds fine to me.  I just want to keep stage separate from prod, however you want to do it.
Separate staging property set up, but with no custom CNAME for it, so we can just use the direct Edgecast FQDN naming scheme instead. Should be just right for a staging environment.

Should work in 1 hour or less:
https://gs1.adn.edgecastcdn.net/801237/addons-cdn.allizom.org/media/img/zamboni/global/bg-header.png


Let us know when you would want to make this change, and for which site (preview.a.m.o, presumably?).


CC'ing Jeremy on this for his assistance...

Jeremy, I believe we would want to change the STATIC_URL line in:
mradm02:/data/amo_python/www/preview/zamboni/settings_local.py
to:
STATIC_URL = 'https://gs1.adn.edgecastcdn.net/801237/addons-cdn.allizom.org'
And then push it out. Could you double check me on this, and potentially guide me through it in IRC when they're ready?
(In reply to comment #23)
> Should work in 1 hour or less:
> https://gs1.adn.edgecastcdn.net/801237/addons-cdn.allizom.org/media/img/zamboni/global/bg-header.png

Something somewhere is redirecting this to: 

https://addons-cdn.allizom.org/media/img/zamboni/global/bg-header.png

I don't think this is anything in Edgecast, because the same behavior is not present for the static.a.m.n property. There appears to be a RewriteRule in the VirtualHost for that site, that redirects anything not using the HTTP_HOST header addons-cdn.allizom.org over to that. This would need to be updated or removed as well.
Stage doesn't need to use CDN right?  Doesn't warrant the cost imo.
We were intending to use the stage environment to test the CDN. The only way to do that is to point a CDN property at the stage origin.

I do know that this one does *not* involve a separate certificate setup or anything complicated- in fact Edgecast support isn't even involved in it. It's much simpler than the prod setup is. This might influence the cost (at the very least, no SSL cert cost).

It really depends how thoroughly the stage environment should duplicate prod, and how much we're willing to spend to make that happen. I don't have the costs, so I'm not in a position to make that call. I would typically expect the CDN pricing to be based on bandwidth or "number of CDN nodes used", as opposed to "number of properties configured".

Without knowing the costs, I think we should forge ahead with a CDN setup for stage, at least long enough to know it's going to work the way we expect it to. Once we know it works, we can revert stage back to a non-CDN deployment and kill all the CDN configs for it.
(Personas staging uses the EdgeCast CDN, FWIW: http://personas.stage.mozilla.com/en-US/.  I haven't really found issues in the past when testing CDNs for static content, but I'd love to see us come as close as possible, if possible.)
This is completed! I have verified with Edgecast that the following CDN URL is functional and returns a valid cert. It is somewhat strange to look at because the Common Name (CN) field is still an Edgecast domain name, but static-cdn.addons.mozilla.net is listed elsewhere in the cert, and neither Firefox 4 nor Safari complain about it.

https://static-cdn.addons.mozilla.net/media/img/zamboni/global/bg-header.png


PLEASE NOTE: we are still using that particular file as the 'validation item'. This means the CDN will potentially start having problems if that file is ever moved, renamed or deleted!


As for staging, DNS is current set up to point 'addons-cdn.allizom.org' straight back to the main staging site (addons.allizom.org). I can fairly easily change this to point to Edgecast... just let me know if/when you want to do this. This should be only a DNS change if STATIC_URL is already set up to point to that record.


You can use static-cdn.addons.mozilla.net (prod) right now, if you like. I understand it would just be a change of the STATIC_URL variable... just let us know when you'd like to do that.
(In reply to comment #28)
> This is completed! I have verified with Edgecast that the following CDN URL is
> functional and returns a valid cert. It is somewhat strange to look at because
> the Common Name (CN) field is still an Edgecast domain name, but
> static-cdn.addons.mozilla.net is listed elsewhere in the cert, and neither
> Firefox 4 nor Safari complain about it.
> 
> https://static-cdn.addons.mozilla.net/media/img/zamboni/global/bg-header.png
> 
> 
> PLEASE NOTE: we are still using that particular file as the 'validation item'.
> This means the CDN will potentially start having problems if that file is ever
> moved, renamed or deleted!
> 
> 
> As for staging, DNS is current set up to point 'addons-cdn.allizom.org'
> straight back to the main staging site (addons.allizom.org). I can fairly
> easily change this to point to Edgecast... just let me know if/when you want to
> do this. This should be only a DNS change if STATIC_URL is already set up to
> point to that record.
> 
> 
> You can use static-cdn.addons.mozilla.net (prod) right now, if you like. I
> understand it would just be a change of the STATIC_URL variable... just let us
> know when you'd like to do that.

Jake, if we encounter problems in staging, this is easy to flip back, right (e.g. just reverting the change to STATIC_URL).

Krupa, if we work together on this (with me taking the brunt of it), are you willing to give it a whirl, now?

AFAIK, we haven't had to update our current Selenium tests that take image URLs into consideration (from the change to addons.allizom.org -> addons-cdn.allizom.org), so that, at least, shouldn't be an issue.
Staging can be either a DNS change or a STATIC_URL change, depending on if we want to use our own DNS name or Edgecast's, but yes it's easily revertible either way. I'm out for the day, but we can hit this up tomorrow morning if you like.
Did this get staged this morning? We need this rolled out at the latest by May 2nd...is that possible?
(In reply to comment #31)
> Did this get staged this morning? We need this rolled out at the latest by May
> 2nd...is that possible?

WebQA's ready -- Jake, fire away any time on preview; thanks!
DNS changed for addons-cdn.allizom.org, waiting for propagation- should be 15 minutes or so.

Old record:
addons-cdn.allizom.org.	600	IN	CNAME	addons.allizom.org.

New:
addons-cdn.allizom.org.	600	IN	CNAME	gs1.adn.edgecastcdn.net.


As mentioned above, this will work, but will result in an invalid certificate warning because Egecast is presenting their own cert for this property. Prod will present a valid cert, however.
(In reply to comment #33)
> DNS changed for addons-cdn.allizom.org, waiting for propagation- should be 15
> minutes or so.
> 
> As mentioned above, this will work, but will result in an invalid certificate
> warning because Egecast is presenting their own cert for this property. Prod
> will present a valid cert, however.

Reverted, the cert problem breaks QA testing. Stephen is looking into an exception, but beyond that the only way to make stage work is to get a real cert for it on Edgecast's side, like prod has. Do we care that much? I have no idea what the cost is.

We cannot use the Edgecast domain name, see comment 24... something on our side is redirecting that to addons-cdn.allizom.org, which is just a CNAME for addons.allizom.org because of the certificate issue.
We have moved addons-cdn.allizom.org back to Edgecast, and added an exception for the cert. Additionally, I had to fix a couple config issues with the stage property that were causing 404 errors... prod is already set up properly.

Please let me know if/when we want to cut over the STATIC_URL field for prod.
This is Krupa's call, and she can comment on timing.  (I've tested initially, but nothing much on AMO besides clicking around, until now -- I'll do some testing tonight.)

We're having our testers run a few test suites (manual), that cover more dynamic things like changing user-profile images, add-on and collection icons on initial upload, etc.  I can say that all of our Selenium automation for AMO is passing, with no failures related to static content since the cutover.
Component: Server Operations: Web Content Push → Server Operations
Whiteboard: [waiting on QA approval]
(In reply to comment #31)
> Did this get staged this morning? We need this rolled out at the latest by May
> 2nd...is that possible?

I pinged you in #release-drivers; Christian, what's special about May 2nd?  Tied to a particular release?
We need this done before we do the advertised major update to all 3.5/3.6 users so the get addons page in FF4 isn't slow for Asian locales (bug 635610).

We were originally planning to ship on the 3rd (which is why it needed to be ready by the 2nd) but it has now been pushed back to the 5th (so we should have this done by the 4th)...see https://wiki.mozilla.org/Releases/ for specific release dates.
This shouldn't be tied to a release - it's simply a demo/trial of EdegCast's ADN/SSL platform.  We haven't committed to it long term.

We may but I want to set expectations.
The 10 sec loading for Asian users has been deemed a potential blocker for mass FF4 rollout, and this bug is supposed to help. Let's get it in place before the advertised update and see if it helps in any case.
fyi, it's > 20 seconds from Buenos Aires .  Enough that if I didn't know better I'd think the addon pane wasn't working.

jake, eta to go live?
Just had to throw out that you guys are in Argentina, eh? ;-) well, I will confirm the 10s from Indonesia once I land in 24 hours :-D

But yeah, feels weird to block on this but to a user it Iooks broken and is total clown shoes for an Internet project/company.
ETA to go live is just a config change and push... just need to change the STATIC_URL variable in settings_local.py and push it out. There's already an AMO push on Thursday, it could go out with that.

Stephen / Krupa, are we good-to-go on this? This Thursday is the last already-scheduled time to push this before your deadline... If possible I'd prefer this go out in a normal push like anything else, rather than a one-off exception.
Duplicate of this bug: 512570
We currently have two blocking bugs which are open. If they are resolved, we are good to go.

The blocking bugs are bug 652796 and bug 652853
Depends on: 652796, 652853
Whiteboard: [waiting on QA approval]
Depends on: 653042
This no longer has any blocker bugs. Are we clear to change STATIC_URL during today's 6.0.7 launch/push?
Not yet.

Krupa thinks bug 653475 may be caused by this.
Depends on: 653475
This didn't go out today.  We'll have to work on that bug separately.  If this needs to go out before next Thursday, we should coordinate another push for monday or tuesday.  Does it?
Are we doing this tomorrow? Push plan?
QA: can you give us another sign off on PAMO?  I'd suggest for a push plan:

1) QA signs off on PAMO
2) We tag and update AMO (no config changes)
3) QA adjusts their DNS to use the new CDN and signs off on AMO
4) We do the config change to point to the new CDN
(In reply to comment #51)
> QA: can you give us another sign off on PAMO? 

Yep. Everything is looking good on PAMO.
We have written a Zeus TrafficScript that will intercept responses to my IP and replace 'static.addons.mozilla.net' with 'static-cdn.addons.mozilla.net', for testing this out on prod.

This has revealed that there is a different cert error that would affect production- sec_error_unknown_issuer is the FF error code, affecting at least FF4 on Windows 7. I have emailed Edgecast about this, and am awaiting a response.
So this push didn't happen due to comment 53?
Correct. They replied late last night, and are investigating this.
The symptoms in comment 53 could have been due to the site not serving intermediate certs (in which case it may or may not work depending on whether you've already browsed a site using the same intermediates that session). It works for me right now, for instance.

SSLLabs says I'm right
https://www.ssllabs.com/ssldb/analyze.html?d=static-cdn.addons.mozilla.net

NB: skip the clean bill of health at the top and look at the "Chain issues" line.

It should be trivial for the CDN to add the intermediates. Surely they've done that before (otherwise they're mostly serving broken SSL sites).
That was my assessment as well- incomplete certificate chain. Some browser/OS combos might know that Digicert CA cert directly, but apparently not all. I've poked them again.
Edgecast has confirmed that there is/was a problem with the intermediate cert on their end, and that it has been fixed. However, they are also saying it will be another 6 hours before this is live in production (there is testing on their end, plus propagation... or so I understood).
This should make it fixed before the Firefox 4.0.1 advertised update tomorrow then.
Update on my end:

* all our automation for AMO (sans 1 known, completely unrelated bug) was passing after we accepted the first cert (https://github.com/mozilla/moz-grid-config/commit/e6e863c39f67a2cd3660c47a0702c4fa45bb48a5#firefoxprofiles), 
* we got hit from the cert change, and all tests started failing, with "session issues" -- our Selenium browsers were getting the sec_error_unknown_issuer error-prompt -- modal, blocking dialog, hence browser "hangs" (http://qa-selenium.mv.mozilla.com:8080/view/AMO/job/amo.smoketests/190/console)
* so we landed (https://github.com/mozilla/moz-grid-config/commit/c4fba25a5b8b8dd7ef23b16f5799cd792e084535), the new cert's cert_override value that Selenium needs
* on the next run, the session failures went away (as expected), and a new, unrelated failure (http://qa-selenium.mv.mozilla.com:8080/job/amo.smoketests/191/console) from https://github.com/jbalogh/zamboni/commit/5dc1cfe79ff99fe7c93d01041eb61757137d737b#diff-11 made us fail (we just need to update our test, it's not a real failure)

Things are looking great to me: all dependent bugs have been fixed, with no new issues discovered so far.  

Krupa has the final say -- what's your take, Krupa?
Edgecast has confirmed that the intermediate cert issue is fixed, and it appears to be so in my testing as well. We're good to go. QA, if you want to test it in production, let me know what your IPs are and I'll make the trafficscript send you to 'static-cdn' instead of 'static'.
STATIC_URL change is made on mradm02, this will go out with today's 2pm push.
This is out.  How does it feel now?
(In reply to comment #64)
> This is out.  How does it feel now?

Looks good; literally the shortest build/run-times we've seen, now: http://qa-selenium.mv.mozilla.com:8080/job/amo.prod.smoketests/buildTimeTrend
Okay to close this out?
Works for me.  I actually meant comment 64 to be in bug 635610
Excellent. Thanks to all for the very nice improvement in load times. Argentina went from ~20 seconds to 6-7 seconds... and Edgecast doesn't even *have* caching nodes in South America- nearest are Miami and Dallas.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Calling this Verified FIXED -- we tested and verified it manually during the deploy, and our automation has been passing without issue, since, both on trunk and production, and I haven't heard of additional problems.
Status: RESOLVED → VERIFIED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.