Closed Bug 1305129 Opened 4 years ago Closed 4 years ago

requests to https://ssl.google-analytics.com/collect fail with SSLError

Categories

(Socorro :: Webapp, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: willkg)

References

Details

Attachments

(2 files)

I debugged why the google analytics code fails. The exception is:

SSLError("bad handshake: Error([('SSL routines', 'SSL3_GET_SERVER_CERTIFICATE', 'certificate verify failed')],)"
As a first step, I'm going to upgrade pyOpenSSL, pyasn1 and ndg-httpsclient since they're all out of date at the moment.
Assignee: nobody → peterbe
Note-to-self; here's how to debug this in a stage webhead.

1) $ cd /data/socorro/webapp-django
2) $ sudo vi crashstats/base/ga.py
3) Uncomment the commented logging lines near the top of that file
4) sudo systemctl restart socorro-webapp
5) sudo tail -f /var/log/messages
Blocks: 1306731
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/2c75ece0126ff3d19ea98627e9b38e3e8b488936
Bug 1305129 upgrade pyopenssl pyasn1 and ndg httpsclient (#3485)

* bug 1305129 - upgrade pyOpenSSL, pyasn1 and ndg-httpsclient

* also upgrade to cryptography==1.5
Despite that upgrade, we still get this:

Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: Traceback (most recent call last):
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/raven/transport/threaded_requests.py", line 26, in send_sync
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: super(ThreadedRequestsHTTPTransport, self).send(data, headers)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/raven/transport/requests.py", line 35, in send
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: verify=self.verify_ssl, timeout=self.timeout)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/api.py", line 107, in post
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: return request('post', url, data=data, json=json, **kwargs)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/external_trace.py", line 103, in dynamic_wrapper
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: return wrapped(*args, **kwargs)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/api.py", line 53, in request
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: return session.request(method=method, url=url, **kwargs)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: resp = self.send(prep, **send_kwargs)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/raven/breadcrumbs.py", line 279, in send
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: resp = real_send(self, request, *args, **kwargs)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/newrelic-2.56.0.42/newrelic/api/external_trace.py", line 103, in dynamic_wrapper
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: return wrapped(*args, **kwargs)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: r = adapter.send(request, **kwargs)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/adapters.py", line 447, in send
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: raise SSLError(e, request=request)
Oct 13 20:21:56 stage-webapp-i-e0c8d9f8 bash: SSLError: ("bad handshake: Error([('SSL routines', 'SSL3_GET_SERVER_CERTIFICATE', 'certificate verify failed')],)",)
Here's what I did to debug this:

1) SSH into a webhead
2) cd /data/socorro/webapp-django
3) sudo vi sudo vi crashstats/base/ga.py
4) Find the commented out logging lines nearish the top and uncomment it
5) sudo systemctl restart socorro-webapp
6) sudo tail -f /var/log/messages
JP, 
Can you please take a look at this. It might very well be a CentOS issue. We're eager to get this rolling so we can more confidently deprecate parts of the old public API.

I can help you debug it. Ping me if/when you need a buddy.
Flags: needinfo?(jschneider)
I'm going to grab this from JP and look into it further tomorrow.

Why me? Three reasons:

1. I worked on the GA code in Input that this GA code is based on.
2. I'm vaguely familiar with SSL and Python
3. I had a dream wherein I was outside sitting in a chair and a cloud began to hover over me and said, "I will rain on you until this problem is solved." I figure that's an omen.
Assignee: peterbe → willkg
Status: NEW → ASSIGNED
Flags: needinfo?(jschneider)
I slimmed it down to a single file that I could run in our socorro-virtualenv that could tell me whether it was working or not. That made it much easier to debug the issue.

What's going on is that when we create a ThreadedRequestHTTPRequest, it (or rather one of the superclasses) has a verify_ssl argument that defaults to True.

When it goes to send stuff, if verify_ssl is True, then it uses the CA bundle that comes with the raven library rather than the one from the system.

If I change the ThreadedRequestsHTTPRequest init args to include verify_ssl=False, then it works fine.



    BUT WAIT... THERE'S MORE!!!



We're using raven 5.27.1 now and we're not that far out of date. The CHANGES for 5.27.0 include "Update CA bundle":

https://github.com/getsentry/raven-python/blob/f9e9bcd28051e7193969e0c495a1cd413df3e29f/CHANGES#L36

I don't see suggestions that they updated it since then.

However, a close inspection of their code reveals this:

https://github.com/getsentry/raven-python/blob/80490488fb5c62d927ac789313f971e02ee395e2/raven/conf/defaults.py#L45

which suggests they might be pulling the bundle from the certifi package.

We're using 2015.11.20 ... The most recent is 2016.9.26.

While I haven't tested it, I think updating the certifi package to a more recent one will fix this issue. Even if it doesn't fix the issue, it's a good thing to do.

I'll do a PR for that tomorrow.
I verified that if verify_ssl is True, it's using /home/vagrant/socorro/socorro-virtualenv/lib/python2.7/site-packages/certifi/cacert.pem which is the ca bundle that comes with certifi.

I tried updating certifi to 2016.9.26, but it still fails in the same way.

If I run the same test script on my local machine (Ubuntu 16.04) using the same versions of requests, certifi and raven, I don't have any issues--works fine.

So at this point, I think we can conclude the problem is in the code validating the cert chain and the CA bundle and it's probably specific to the CentOS we're using.

I skulked around the requests issue tracker for inspiration and bumped into this:

https://github.com/kennethreitz/requests/issues/3212#issuecomment-221016311

In the vagrant environment with CentOS 7.0, I've got openssl version 1.0.1e. We've got the same version in production.




    BUT WAIT... THERE'S MORE!




Later in that requests issue, someone says they downgraded certifi and then everything worked. I tried downgrading certifi to 2015.4.28 and now it works fine in my vagrant environment.


I think we have four options:

1. Set verify_ssl=False when setting up the ThreadedRequestsHTTPRequest session causing it to not verify certificate chains because the version of openssl we're using sucks anyhow so it's not clear we're getting anything helpful out of anything involved

2. Downgrade certifi to 2015.4.28 because downgrading is the new upgrading! Does this make us "less safe"? I have no idea.

3. Upgrade openssl on all our nodes.

4. Give up. Maybe everything will be different next year.


I'm inclined to go with option 3 because that's the "most correct". However, I have no idea how hard it is to update openssl on CentOS, but at this stage in my life, I've accrued enough lousy experiences with CentOS that I bet it's going to be torturously hard and we're all going to want to cry inconsolably at the all hands in december and then we won't get anything useful done.

Given that, I think we should go with option 1 or 2. The current situation is that we've got a broken openssl so it's not clear it's validating certificate chains correctly and maybe options 1 and 2 probably don't make that meaningfully worse.

Lonnen, JP: Thoughts?
Flags: needinfo?(jschneider)
Flags: needinfo?(chris.lonnen)
Are you saying that settings certifi==2015.4.28 solves the problem?
That's 1.5 years old so that just feels gross but perhaps we can simply leave it like that for 6-12 months (with a big fat comment in the requirements.txt to tell people NOT to keep this one up-to-date).

Is this problem "unique" to usage within a ThreadedRequestsHTTPRequest instance, from raven? I see there was a version 5.31.0 that came out about a month ago. Does that have a relevant change?

Dummy question... Why isn't it a huge deal that our openssl in our CentOS is "busted"? Won't that be a potential problem for a bunch of other forms of network communication?
Sorry, I spoke before I tested things. The problem has nothing to do with ThreadedRequestsHTTPRequest necessarily. Happens with a plain requests.post too::

[centos@prod-webapp-i-cc0967d4 ~]$ . /data/socorro/socorro-virtualenv/bin/activate
(socorro-virtualenv)[centos@prod-webapp-i-cc0967d4 ~]$ python
Python 2.7.11 (default, Sep 22 2016, 20:18:29)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.post('https://ssl.google-analytics.com/collect')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/api.py", line 107, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/requests/adapters.py", line 447, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: ("bad handshake: Error([('SSL routines', 'SSL3_GET_SERVER_CERTIFICATE', 'certificate verify failed')],)",)
>>>

And I can see that his advice in that bug about accepting the weaker 1024 cert fixes the problem too::

>>> import certifi; print certifi.old_where()
/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/certifi/weak.pem
>>>
(socorro-virtualenv)[centos@prod-webapp-i-cc0967d4 ~]$ REQUESTS_CA_BUNDLE=/data/socorro/socorro-virtualenv/lib/python2.7/site-packages/certifi/weak.pem python
Python 2.7.11 (default, Sep 22 2016, 20:18:29)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.post('https://ssl.google-analytics.com/collect')
<Response [200]>
>>>
We talked about this on the call and the consensus was:

1. We're using up-to-date packages for openssl available via the centos official repositories. We could roll our own to get a more recent version, but quitting and starting DQ franchises would be more fun.

2. JP pointed out that New Relic has most of this data. Sending data to New Relic wasn't working when we started on the GA route, but it is now. Previously, when New Relic was brought up, I thought we preferred GA data because it does a break down on query string params which was important. Seems like that might not be the case anymore.

3. Peter pointed out we can set an environment variable with the absolute path to the CA bundle to use. Will isn't excited about that because it's pretty fragile because those paths are crazy. We could put the CA bundle in the Socorro repo and it'd be less fragile that way. If we did this, it affects all certificate verification for the webapp. It's an option.

4. Will thinks if we really want to go with GA, then we should use the verify_ssl=False option. We could also subclass ThreadedRequestsHTTPRequest and override the .send() method which is the one that's setting things up such that certificate verification fails.


We should figure out if New Relic is good enough and if it is, then WONTFIX this bug and remove all the GA server-side ping code. If it's not good enough, then I'll probably go with some variant of option 4.

Adrian, Peter: Can you two check out the New Relic data and determine whether we can use that for our usage needs?
Flags: needinfo?(peterbe)
Flags: needinfo?(jschneider)
Flags: needinfo?(chris.lonnen)
Flags: needinfo?(adrian)
JP, Can you explain how we can get a breakdown of URLs (starting with /api/*) used in NewRelic.

The advantage of GA is its ability to breakdown URLs by /api/* and /api/SomeModel/* and /api/SomeModel/?key=value etc. 

Will, you mentioned that it might also be an option to pin a specific version of certifi in requirements.txt. What's you stand on that as an option. 

Also, I'm quite intrigued about the idea of setting `verify_ssl=False` in this specific instance. We know, with certainty, that the URL we're posting to is www.google-analytics.com so it's unlikely to leak to other untrusted domains.
Flags: needinfo?(peterbe) → needinfo?(jschneider)
Adrian, Peter and I spent a bunch of time poking around the New Relic data and don't see what we need. We need to be able to look at usage for urls that start with /api . Further, the New Relic data is sampled, so it's not clear how many uses of an API endpoint would need to happen for it to show up in the New Relic data we don't think exists anyhow.

Given that, I'm going to add a verify_ssl=False to the ThreadedRequestsHTTPRequest instance creation, nix the uid from the params and then that'll fix this.

Clearing the needsinfos.
Flags: needinfo?(jschneider)
Flags: needinfo?(adrian)
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/f6150e77b92a8af489125dd096739f63c6dab9fc
Fixes bug 1305129: GA ssl handshake fix (#3560)

* Fixes bug 1305129: GA ssl handshake fix

Certificate verification is failing. We debated several options (details
in the bug) and then decided to set verify_ssl=False for this specific
call. Because we're doing that, we nixed all possible PII, too.

* Fix tests

This fixes the tests and also removes the client_id argument.

* Rewrite uuid mocking in tests

This cleans up the uuid mocking by converting the repeated stuff into a
decorator.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
I can confirm that we are successfully sending data to google-analytics.com. I know this because I enabled extra logging by editing the ga.py inside the virtualenv in the stage webhead.
However, at the time of writing, we don't see this show up in Google Analytics yet. Lets wait at least a day.
Duplicate of this bug: 1270583
You need to log in before you can comment on or make changes to this bug.