Closed Bug 1331583 Opened 7 years ago Closed 6 years ago

Encourage Heroku apps using the SSL Endpoint addon to switch to the native SNI-based SSL

Categories

(mozilla.org :: Heroku: Administration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Heroku's native SNI-based SSL has been out for a bit now:
https://blog.heroku.com/ssl-is-now-included-on-all-paid-dynos

I see 17 apps using it at the moment - getting those to switch could save $340/month of addons costs (which seems useful, given we're regularly over our addons credits: https://dashboard.heroku.com/orgs/mozillacorporation/usage)

Treeherder is already switching in bug 1316712.

Pros:
* Doesn't require the $20/month addon
* Copes with spikes in load without needing to pre-warm
* Heroku have said it's recommended over the old SSL endpoint addon, and that new features will only be available on the SNI based solution
* the SNI based solution wasn't affected during the Dyn DDOS, unlike the SSL addon

Cons:
* Requires SNI, so can't be used by people on Windows XP who only have Internet Explorer installed (https://en.wikipedia.org/wiki/Server_Name_Indication#Support)

Switching process:
https://devcenter.heroku.com/articles/ssl#migrate-from-ssl-endpoint-to-heroku-ssl

$ heroku addons --all | grep 'ssl:endpoint' | cut -d' ' -f1
bugherder
bugzfeed
bzlite
contribute-json
crates-io
mozilla-pontoon
oneanddone-dev
oneanddone-prod
oneanddone-stage
platatus
pulseguardian
releng-tc-coalesce
releng-tc-coalesce-stage
srihash
taskcluster-auth
taskcluster-cors-proxy
treeherder-prod
s/I see 17 apps using it/I see 17 apps using the legacy addon/
Bug 1173726 comment 9 indicates that bzlite was moved off of Mozilla properties.

Bug 1310292 replaced dev/stage/prod of oneanddone with a simple redirect (not using Heroku).
(In reply to Ed Morley [:emorley] from comment #0)
> Cons:
> * Requires SNI, so can't be used by people on Windows XP who only have
> Internet Explorer installed
> (https://en.wikipedia.org/wiki/Server_Name_Indication#Support)

We can't any longer issue the SHA-1 certificates these people would *also* require to access these sites, were we somehow to circumvent the SNI requirement on their behalf. Python 2.6 API consumers are now the primary issue for migrating certificates to SNI, as we discovered with the Mozillians.org API endpoints.
Depends on: 1331791
(In reply to Richard Soderberg [:atoll] from comment #2)
> Bug 1173726 comment 9 indicates that bzlite was moved off of Mozilla
> properties.
> 
> Bug 1310292 replaced dev/stage/prod of oneanddone with a simple redirect
> (not using Heroku).

Good spot - I've filed bug 1331772 and bug 1331791 for deleting these, since we're still paying for both dynos and addons for them.

> We can't any longer issue the SHA-1 certificates these people would *also*
> require to access these sites, were we somehow to circumvent the SNI
> requirement on their behalf. 

Agreed, however the existing certificate/private key should be available to download from Mozilla's admin page on the CA, which can then be re-uploaded to the new Heroku SNI-SSL endpoint?

> Python 2.6 API consumers are now the primary
> issue for migrating certificates to SNI, as we discovered with the
> Mozillians.org API endpoints.

Agreed we'd need to be careful.

However Python <2.7.9 (including 2.6) can use SNI with requests, by using `requests[security]` in their requirements files (and the appropriate system packages):
http://docs.python-requests.org/en/master/community/faq/#what-are-hostname-doesn-t-match-errors
https://github.com/kennethreitz/requests/blob/580b45dd3caced54a1621ea0ffa88f20af9af4ea/setup.py#L98
https://stackoverflow.com/questions/18578439/using-requests-with-tls-doesnt-give-sni-support/18579484#18579484
(In reply to Ed Morley [:emorley] from comment #4)
> Agreed, however the existing certificate/private key should be available to
> download from Mozilla's admin page on the CA, which can then be re-uploaded
> to the new Heroku SNI-SSL endpoint?

Yep, WebOps can help you retrieve those or at worst reissue them with a replacement private key, if needed.

> > Python 2.6 API consumers are now the primary
> > issue for migrating certificates to SNI, as we discovered with the
> > Mozillians.org API endpoints.
> 
> Agreed we'd need to be careful.
> 
> However Python <2.7.9 (including 2.6) can use SNI with requests, by

Yeah, we learned this after we reverted the Mozillians API to non-SNI, but haven't followed up with "we're going to make the API an SNI-only endpoint, adapt or perish". Doing so in this case with any thus-affected Heroku-hosted APIs does, at the very least, have some chance of minimizing outages or negative reactions after the cutover.
Depends on: 1331998
After the recent dep bugs being resolved, we're down to:

bugherder  --> this is fine to switch
bugzfeed
contribute-json
crates-io
mozilla-pontoon
platatus
pulseguardian
releng-tc-coalesce
releng-tc-coalesce-stage
srihash
taskcluster-auth
taskcluster-cors-proxy

Cameron, does any tooling interact with the pulseguardian Heroku app (that might not support SNI), or just browsers?
Flags: needinfo?(cdawson)
I seem to recall that last time we tried to "upgrade" the SSL settings for taskcluster, Buildbot proved unable to download for some other technical reason (we are already using SNI).  Greg, do you happen to remember context (or pointers to context) for that?
Flags: needinfo?(garndt)
I thought it had something to do with old SSL libraries, but last time I spoke to Heroku, they seem to indicate our endpoint is current and I have not heard people complain.  Other than that I do not remember much.
Flags: needinfo?(garndt)
Crikey, sorry for the slow response.  I don't know of any tooling that calls our APIs directly.  Just the UI.  We could probably move to the native SNI without a problem.
Flags: needinfo?(cdawson)
Depends on: 1336489
For the people needinfod, could you say whether your app is suitable for switching to the Heroku SNI based SSL? (Either because they only have browser consumers who aren't using IE with Windows XP, or we're happy that the tools accessing them can handle SNI). See comment 0 for more info.

:pmac -> https://dashboard.heroku.com/apps/contribute-json
:acrichto -> https://dashboard.heroku.com/apps/crates-io
:mattjazz -> https://dashboard.heroku.com/apps/mozilla-pontoon
:digitarald ->https://dashboard.heroku.com/apps/platatus
:francois -> https://dashboard.heroku.com/apps/srihash

(In reply to Cameron Dawson [:camd] from comment #9)
> Crikey, sorry for the slow response.  I don't know of any tooling that calls
> our APIs directly.  Just the UI.  We could probably move to the native SNI
> without a problem.

Great - I'll add pulseguardian to the list. (To reduce webops work with DNS changes etc, I'd like to have a list of a few that can be done at a similar time)
Flags: needinfo?(pmac)
Flags: needinfo?(m)
Flags: needinfo?(hkirschner)
Flags: needinfo?(francois)
Flags: needinfo?(acrichton)
Thanks for the ping! 

AFAIK for crates-io we *should* be ok to move over to requiring SNI. The main client we'd be worried about would be a program that we manage and typically should have new enough dependencies to work with SNI. The crates-io app though is hit by the wider Rust ecosystem, so we're definitely not the only users!

Would it be possible to test out such a change? Could we flip a switch to require SNI but reserve the ability to flip it back for the next few weeks to make sure the transition goes smoothly?

If we do make the transition, what would that look like? Would we need to change DNS records and require some down time? Would we use the same SSL certificates from before, or would we have to generate new ones?
Flags: needinfo?(acrichton)
Fortunately the switch is zero-downtime for SNI-supporting clients.

Roll-out steps:
1) For sites with non-browser consumers or external facing, make announcement to whatever mailing lists are appropriate, warning about need for SNI support and explaining timeline.
2) Upload SSL certs to the new Heroku SNI system (cannot extract the certs from the existing ELB-based solution, but can re-use certs from the Digicert dashboard or wherever)
3) Update the DNS so the CNAME points to the new endpoint
4) Wait for DNS to propagate (during the transition some requests will hit the old endpoint, some the new, but both will work)
5) Check if any issues

If need to rollback:
6) Update DNS to reset the CNAME target back to previous value
7) Wait for DNS to propagate

Or if happy no rollback required after X days/weeks:
6) Remove the legacy addon from the app, which will be receiving zero traffic but still being billed

Known problematic clients:
* Windows XP users who are on IE (other browsers fine)
* Python <2.7.9 unless the necessary pyOpenSSL/cryptography packages are installed (eg using `requests[security]`)
(Plus the rest on https://en.wikipedia.org/wiki/Server_Name_Indication#Support)
Awesome, thanks for that list! That sounds great to me!

So it sounds like we can take most of these steps independently ourselves, right? If so, I'd propose:

1) We should upload certs to SNI system now
2) Test SNI works at all
3) We'll make posts on our forms garnering feedback
4) Assuming good feedback, switch DNS
5) Assuming good for a few weeks, delete addon

Does that sound right? If so, then two final question:

* Are there docs for how to upload certificates to the SNI system?
* Somewhat unrelated, but does Mozilla have a general solution for SSL certs on Heroku through, for example, Let's Encrypt?
(In reply to Ed Morley [:emorley] from comment #10)
> For the people needinfod, could you say whether your app is suitable for
> switching to the Heroku SNI based SSL? (Either because they only have
> browser consumers who aren't using IE with Windows XP, or we're happy that
> the tools accessing them can handle SNI). See comment 0 for more info.
>
> :francois -> https://dashboard.heroku.com/apps/srihash

I don't see any problems with relying on SNI. This site is for developers only and in any case, SRI itself is only supported on recent versions of Chrome and Firefox.
Flags: needinfo?(francois)
contribute-json is approved for the switch. I have the cert files and could add them, but I lack sufficient permission in Heroku to do so. Let me know if you need anything from me to proceed. Thanks!
Flags: needinfo?(pmac)
(In reply to Alex Crichton [:acrichto] from comment #13)
> So it sounds like we can take most of these steps independently ourselves,
> right? If so, I'd propose:
...
> 2) Test SNI works at all
...
> Does that sound right?

Yes that sounds great. The testing at #2 can be performed like so:

- Look up new `ZZZ.herokudns.com` SNI CNAME using `heroku domains`
- Grab the IP of the new endpoint (eg `nslookup ZZZ.herokudns.com`)
- curl -i https://FOO.mozilla.org --resolve 'FOO.mozilla.org:443:NEW_IP_ADDRESS'

> * Are there docs for how to upload certificates to the SNI system?

https://devcenter.heroku.com/articles/ssl#add-certificate-and-intermediaries

> * Somewhat unrelated, but does Mozilla have a general solution for SSL certs
> on Heroku through, for example, Let's Encrypt?

I believe the only option is to use the same process as for internally hosted domains.

There are some awfully hacky third-party solution available to use Let's Encrypt on Heroku (eg https://github.com/substrakt/letsencrypt-heroku), but it involves having to put your Heroku API key on an app somewhere, which seems suboptimal.

Heroku hasn't stated this explicitly, but my guess is that their "we highly recommend that you switch to Heroku SSL as we will be rolling out exciting new features to it over the coming months" (https://blog.heroku.com/ssl-is-now-included-on-all-paid-dynos) might include Let's Encrypt support.

(In reply to Paul [:pmac] McLanahan from comment #15)
> contribute-json is approved for the switch. I have the cert files and could
> add them, but I lack sufficient permission in Heroku to do so. Let me know
> if you need anything from me to proceed. Thanks!

I've added the `manage` permission for you on the contribute-json app, so it should work now (https://devcenter.heroku.com/articles/app-permissions-cheatsheet).
Ah it looks like I'm also unable to modify certs on the crates-io application (no `manage` permission), should I send the certs to you?
I've given you `manage` permissions for the crates-io app now :-)
Thanks!
Attempt failed. I do have the proper permissions now, but apparently the cert has been updated since and the one I have is expired. Not sure who has the current one. Apologies for the churn. I'll see what I can track down.
I've deleted the SSL addon from the crates-io application, appears everything has transitioned smoothly!
We're now down from 17 to 11 apps.

Remaining... (Using `heroku addons --all | grep 'ssl:endpoint' | cut -d' ' -f1`)

Fine to switch: 
bugherder
pulseguardian
srihash
bugzfeed (bug 1336489)
contribute-json (waiting on finding the correct cert)

No reply yet:
mozilla-pontoon (:mattjazz)
platatus (:digitarald)

Might not be able to switch due to compat:
releng-tc-coalesce
releng-tc-coalesce-stage
taskcluster-auth
taskcluster-cors-proxy

I'll file a webops bug shortly to start the switch for bugherder/pulseguardian/srihash/bugzfeed
I think Webops can either dig up the right cert for contributejson or just reissue it, if no one else knows where it's at. Ping if needed.
Depends on: 1342396
Dustin, are you ok with me removing the ssl-endpoint addon from releng-tc-coalesce-stage (not releng-tc-coalesce), since it appears unused?

$ heroku domains --app releng-tc-coalesce-stage
=== releng-tc-coalesce-stage Heroku Domain
releng-tc-coalesce-stage.herokuapp.com

$ heroku certs --app releng-tc-coalesce-stage
releng-tc-coalesce-stage has no SSL certificates.
Use heroku certs:add CRT KEY to add one.
Flags: needinfo?(dustin)
Assignee: nobody → emorley
I'll defer to Rok for that.  I think :dividehex set it up, but Rok probably has better state right now.
Flags: needinfo?(dustin) → needinfo?(rgarbas)
Depends on: 1342983
I'm proxying this to :dividehax: since this is not part of mozilla-releng/services[1]. Maybe in the future we can port it to mozilla-releng/services, but there is no plan for it right now.

[1] https://github.com/mozilla-releng/services
Flags: needinfo?(rgarbas) → needinfo?(jwatkins)
To clarify for releng-tc-coalesce-stage specifically - there doesn't appear to be a custom domain configured, so the SSL endpoint isn't being used. As such, I was proposing just removing this unused endpoint (the app would still be accessible via the releng-tc-coalesce-stage.herokuapp.com domain), not the app itself.
Looking at the use of the staging app, it doesn't appear that this app is used at all.  I have emailed a few people that are probably interested in this app to discuss if the app can be removed entirely.  In the meantime though, I do not think removing this SSL endpoint would hurt anything, especially since it's just a staging app with no custom domain setup.  That's just my two cents as someone not as familiar with this app.
Agreed.

$ heroku addons:remove ssl --app releng-tc-coalesce-stage
 !    WARNING: Destructive Action
 !    This command will affect the app releng-tc-coalesce-stage
 !    To proceed, type releng-tc-coalesce-stage or re-run this command with
 !    --confirm releng-tc-coalesce-stage

releng-tc-coalesce-stage
Destroying ssl-trapezoidal-93407 on releng-tc-coalesce-stage... done

Leaving needinfo on :jwatkins for the decision as to whether the app is used at all.
I followed up in email to garndt, coop, and catlee to make sure we get an owner for this service if it's in use/critical.
Flags: needinfo?(jwatkins)
Success! www.contributejson.org has been moved to the SNI SSL Heroku product and the old SSL addon has been removed.

Thanks!
I am unsure from the thread what the action item for me is here. Do I need to do anything to switch or does it just need my thumbs up?
Flags: needinfo?(hkirschner)
(In reply to :Harald Kirschner :digitarald from comment #32)
> I am unsure from the thread what the action item for me is here. Do I need
> to do anything to switch or does it just need my thumbs up?

Hi! The needinfo was added in comment 10, which asked for a thumbs up :-)
As a bonus for apps using the new SNI SSL, they can zero-downtime switch to Heroku's new free automated-renewal cert solution that makes use of Let's Encrypt:
https://devcenter.heroku.com/articles/automated-certificate-management

This can be enabled with just one CLI command and takes as little as 3 minutes to generate/use the new cert (for an example see bug 1363814 comment 2).
(I encourage switching to the free SSL Heroku option wherever possible. Just mention to Webops so they can refund the previous cert.)
Is anybody looking into adding ``requests[security]`` dependencies (pyOpenSSL, ndg-httpsclient, pyasn1) and enable SNI support in m-c?
Which environments are still using Python < 2.7.9 ?
On windows we use python 2.7.5 and I was told to rather not attempt to upgrade it (it was tried before and failed) and rather wait for migration to taskcluster to happen (which should happen by the end of the year if I understand it correctly).

There is also Ubuntu 12.04 being used in one of the images[1], but that should not be that hard to fix.

So we either add above mentioned packages or wait with SNI adoption until windows plaform is migrated to taskcluster.


[1] https://dxr.mozilla.org/mozilla-central/source/taskcluster/docker/recipes/ubuntu1204-test-system-setup.sh
(In reply to Rok Garbas [:garbas] from comment #36)
> Is anybody looking into adding ``requests[security]`` dependencies
> (pyOpenSSL, ndg-httpsclient, pyasn1) and enable SNI support in m-c?

Which Heroku sites is m-c accessing as part of the build, for the record, so that we don't accidentally SNI them someday?

(I wasn't able to immediately determine which m-c resources are fetched from Heroku, which probably isn't surprising given my unfamiliarity.)
:atoll: At this point none.

I was testing (on try) new urls (with apps deployed on heroku) for clobberer/tooltool/treestatus services (http://api.pub.build.mozilla.org) and found out about m-c old python version "the hard way" :P

For the purpose of documenting this are the services I'm talking about

  (1) treestatus
      current url -> https://api.pub.build.mozilla.org/treestatus
      new url -> https://treestatus.mozilla-releng.net
      heroku app -> https://dashboard.heroku.com/apps/releng-production-treestatus

  (1) tooltool
      current url -> https://api.pub.build.mozilla.org/tooltool
      new url -> https://tooltool.mozilla-releng.net
      heroku app -> https://dashboard.heroku.com/apps/releng-production-tooltool

  (3) clobberer
      current url -> https://api.pub.build.mozilla.org/clobberer
      new url -> https://clobberer.mozilla-releng.net (not yet running in production, but will be soon)
      heroku app -> https://dashboard.heroku.com/apps/releng-production-clobberer
Thank you so much for writing all this up, I really appreciate it.
I'd like to call out here that per bug 1380177, we explicitly re-enabling the ssl endpoint addons for the releng apps mention in comment#40.  It hurts but the alternative it much worse. :-/
Thank you for linking to that. 

To clarify the "re-enabling" reference for anyone else reading the bug - this bug hasn't touched those apps due to (comment 7 and others), so either they never used the non-SNI addon or else it was removed elsewhere. I believe it's more that they are newish and are slowly replacing other systems, and it's during the migration to them that the mozilla-central incompatibility has been spotted.
Depends on: 1436962
Bugherder was switched over in bug 1331583 to save having to perform a manual cert renewal.

Updated list of apps still using the legacy SSL addon:

$ heroku addons --all | grep 'ssl:endpoint' | cut -d' ' -f1
mozilla-pontoon
pulseguardian (addon is actually no longer used; bug 1436962 to remove)
releng-production-clobberer
releng-production-tooltool
releng-production-treestatus
releng-staging-tooltool
releng-staging-treestatus
releng-tc-coalesce
srihash
taskcluster-auth


I'm guessing that the work being performed to make automation compatible with GitHub's TLSv1.2+ enforcement will help with future efforts to migrate the releng apps above.
Most users have switched over - lets call this fixed.

The big win with the new SSL implementation is being able to use Heroku ACM (auto-renewing Let's Encrpyt certs), so I think the remaining moves will happen naturally once the existing Digicert certs come up for renewal.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(m)
Resolution: --- → FIXED
For posterity,

treestatus and tooltool have been switched over in bug 1487798
coalesce has been switched over in bug 1495489
Taskcluster will migrate away from Heroku in bug 1457610.
You need to log in before you can comment on or make changes to this bug.