s3 connections fail at startup [antenna]

RESOLVED FIXED

Status

RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: willkg, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

For the last few weeks at least, when we start up an Antenna node in -dev or -stage, it takes a while for that node to "figure itself out". During this period, there's tons of nonsensical interleaved lines in the logs like this:

app.add_route('breakpad', '/submit', BreakpadSubmitterResource(config))
client_config=config, api_version=api_version)
File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 70, in create_client
File "/app/antenna/ext/s3/connection.py", line 139, in _build_client
File "/app/antenna/app.py", line 251, in get_app
timeout=(new_config.connect_timeout, new_config.read_timeout))
ValueError: Invalid endpoint: https://s3..amazonaws.com
File "/app/antenna/ext/s3/connection.py", line 99, in __init__
File "/app/antenna/app.py", line 251, in get_app
ret = fun(*args, **kwargs)
client_config=config, api_version=api_version)
File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 224, in _get_client_args
File "/usr/local/lib/python3.5/site-packages/boto3/session.py", line 263, in client
self.crashstorage = self.config('crashstorage_class')(config.with_namespace('crashstorage'))
Traceback (most recent call last):
timeout=(new_config.connect_timeout, new_config.read_timeout))
ret = fun(*args, **kwargs)
ValueError: Invalid endpoint: https://s3..amazonaws.com
File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 224, in _get_client_args
ret = fun(*args, **kwargs)
File "/app/antenna/ext/s3/connection.py", line 99, in __init__
timeout=(new_config.connect_timeout, new_config.read_timeout))
self.crashstorage = self.config('crashstorage_class')(config.with_namespace('crashstorage'))
File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 224, in _get_client_args
File "/usr/local/lib/python3.5/site-packages/boto3/session.py", line 263, in client
File "/app/antenna/ext/s3/connection.py", line 99, in __init__
File "/app/antenna/app.py", line 251, in get_app
ValueError: Invalid endpoint: https://s3..amazonaws.com
File "/usr/local/lib/python3.5/site-packages/boto3/session.py", line 263, in client
client_config=config, api_version=api_version)
verify, credentials, scoped_config, client_config, endpoint_bridge)
File "/app/antenna/ext/s3/crashstorage.py", line 47, in __init__
File "/app/antenna/ext/s3/connection.py", line 139, in _build_client
self.client = self._build_client()
File "/app/antenna/ext/s3/connection.py", line 139, in _build_client
self.client = self._build_client()
verify, credentials, scoped_config, client_config, endpoint_bridge)
app.add_route('breakpad', '/submit', BreakpadSubmitterResource(config))
aws_session_token=aws_session_token, config=config)
[2017-03-13 23:45:41 +0000] [ANTENNA ip-172-31-57-191 25] [ERROR] antenna.app: Unhandled startup exception


What's going on is that the app creates a BreakpadSubmitterResource, that creates an s3 connection, that goes to test itself by HEADing the s3 bucket and then something happens and it fails with an "invalid endpoint" error:

ValueError: Invalid endpoint: https://s3..amazonaws.com

After the "period of calamity" has passed, the Antenna processes start connecting and working fine and then everything is super.


This bug covers trying to figure out what's going on. Is it bad? Can we ignore it? Does it have consequences in how we think about our system?
PR 181 landed and deployed. I still see startup errors on -stage, but now they get retried. I'm not sure if that made the situation better, worse or didn't affect it at all because the logs are a mess.

I also landed a patch to log the tracebacks as a single line rather than an interleaved mess--that should help.

Beyond that, I'm pretty stuck. My thoughts are these:

1. increase the time between retries and maybe add some jitter--the theory here being that we're getting rate-limited because x nodes with y processes are all trying to get credentials at the same time

2. switch from binding credentials to the EC2 node to providing them in configuration


I'm kind of loathe to keep spending time looking into this.

I threw together a PR for option 1 (https://github.com/mozilla/antenna/pull/185). If that doesn't pan out and nothing new has been discovered that gives us other options, then I'm going to suggest we switch the way we get credentials.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
First off, the logs are a lot easier to read now. Yay!

Second, I think it looks a lot better. Increasing the retry wait time and adding jitter seems to have reduced the number of total failures. I only see 4 or 5 in the last batch. That means most of the Antenna processes managed to get a connection.

I think this is good enough to put on hold for a while. Given that we've still got issues, the issue isn't really resolved, but it's sufficiently alleviated.

I'm going to unassign it from me and remove it as a blocker for getting Antenna to -prod.
Assignee: willkg → nobody
No longer blocks: 1315258
Status: ASSIGNED → NEW
Depends on: 1342619
This was definitely from having no configuration. That occurred when we were creating the AMI. Antenna shouldn't be running there, though. That was fixed in bug #1342619.

This is no longer happening, so we're all good here. Marking as FIXED.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
Switching Antenna bugs to Antenna component.
Component: General → Antenna
You need to log in before you can comment on or make changes to this bug.