Closed Bug 1342619 Opened 7 years ago Closed 7 years ago

antenna initially has no configuration at startup

Categories

(Socorro :: Antenna, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: miles, Assigned: miles)

References

Details

Attachments

(1 file)

Will and I observed during today's load test that the APM-enabled instance of antenna that was being tested was spawning up to 16 gunicorn workers upon initial startup. This was observed using the logs of the docker-antenna service live on the instance during the test.

The logfiles combine many startups and of the docker-antenna service from whenever the supervised process is exited and restarted. Upon starting up, gunicorn spawned 16 workers. We configure gunicorn using env variables passed to Docker that are then used in the `docker run` command is executed by systemd - but we configure gunicorn with 5 workers (2 * cores in a m4.large + 1).

In these logfiles, we can see the different antenna workers starting up, and in many cases failing because of a failure to get the correct s3 endpoint url. This failure doesn't happen to all of the workers, though, and after a few restarts of antenna (over a ~1.5m period) we get a "clean" log in which none of the workers have failed, and there are 5 workers on startup. Somewhere in there, a lot of workers failed, and we went from gunicorn starting 16 workers and having quite a few fail to gunicorn starting 5 workers that all work.

There are some clear issues here:
  1. There are too many gunicorn workers being started some of the time
  2. Sometimes we fail to have the correct s3 endpoint url when we need it

Why 1 happens is a mystery left to next week.

For 2, we theorized that boto3 makes calls against the instance metadata service, and that because we're spawning a bunch of antenna workers, hitting it simultaneously, and getting throttled. It's a bit bizarre that it would throttle as quickly as 16 simultaneous requests, but once any of those first 16 workers fail, the service is killed and restarted by systemd and the antenna workers come back up, eager to hit the instance metadata service.

2 will probably resolve itself once we spawn fewer workers if the theory above is correct. Another solution for 2 is to not use a boto query to populate the s3 endpoint, but rather be a configuration value or the interpolated value of the bucket name + region. Are there other calls to s3 on worker startup that hit instance metadata? (if so, that fix would be ineffective)
Correction: we use c4.larges, which has half the memory of m4.larges. [1]
It might be prudent to switch to m4.larges for the memory bump.

[1] http://www.ec2instances.info/?selected=m4.large,c4.large
Making this block the antenna to -prod bug.
Blocks: 1315258
I'm going to wrap boto3 client creation in a retryable decorator so it'll get retried a few times before ultimately failing. That might alleviate that specific issue assuming our theory is correct.
Miles: I landed a fix for bug #1342583, so startup errors should make their way to Sentry now. We should push that to -stage and maybe it'll help us shed more light on this bug.
There still aren't any startup errors on Sentry. I opened up bug #1347183 to cover that.
There aren't any startup errors in Sentry because the startup errors are from s3 connections failing due to a bad endpoint which only happens at startup for Antenna if it's got no configuration.

Seems like bug #1347183 (no startup errors in sentry), bug #1347185 (s3 connections fail at startup) and this one all have the same cause--no configuration at startup.

I'm assigning this to Miles--it's probably related either to the way the docker container is started or something like that.
Assignee: nobody → miles
Status: NEW → ASSIGNED
Changing the summary to reflect the key issue.
Summary: Resolve antenna's "too many gunicorn workers" issue → antenna initially has no configuration at startup
I spent some time parsing logs. Here's what I did:

1. grabbed a bunch of logs from the -stage grephost

I have logs for 3/19 to now.

2. skimmed the logs for a list of hosts that spit out "Unhandled startup exception" in the logs at some point

ip-172-31-16-38
ip-172-31-16-97
ip-172-31-22-207
ip-172-31-30-253
ip-172-31-49-101
ip-172-31-53-59
ip-172-31-58-198
ip-172-31-60-201
ip-172-31-62-120

3. filtered the logs for those hosts on a host-by-host basis

Now let me tell you some stories.


One day, on the ice planet AWS-EC2, there lived an ephemeral entity known to his friends as ip-172-31-53-59 or 59y for short. 

In its most recent life, 59y came into being on 2017-03-24 17:24:00, had lots and lots of startup errors and then went totally silent at 2017-03-24 17:25:02.

59y never got a chance to receive or save any crashes.


On the third moon of the ice planet AWS-EC2, there lived another ephemeral entity known to his friends as ip-172-31-62-120 or 120y for short.

According to the records of the universe as transcribed on the eternal -stage grephost, 120y came into being on 2017-03-22 17:15:19, worked really hard receiving and saving crashes until 2017-03-22 18:52:58 at which point, 120y seems to have ceased to exist.

Then roughly 6 hours later, on 2017-03-23 00:06:46, 120y comes back into being, has lots and lots of startup errors and then goes totally silent at 2017-03-23 00:08:09. 120y doesn't save or receive any crashes in this particular run.


In the deep crevice of an asteroid elliptically circling (ellipticling?) AWS-EC2, is the third entity we examined. Its name was ip-172-31-60-201, but a devilish space toad once called it 201y, so we'll call it that for brevity.

201y pops into existence at 2017-03-23 01:17:07, has lots and lots of startup errors and snuffs it at 2017-03-23 01:18:48 never to be seen by the eternal grephost again.


That's three datapoints. 2017 and 59y are the same: they start up with no configuration, flail around for a few minutes, and then are gone. 

I think 120y is running fine, is shut down, and then the ip address is used for a new node which starts up with no configuration, flails around for a few minutes, and then is gone.


Daniel put out the theory that the case where Antenna is started up with no configuration happens when the AMI is being built--for some reason, docker comes up before things are fully built, Antenna flails around for a few minutes and then it's all shut down. Those logs are persisted in systemd of the AMI and then when new nodes come up, they use that AMI and the logs get transferred to the logging infrastructure. Or something like that.

That seems plausible to me from what I'm seeing.


Miles: Can you check to see if we built AMIs at or very close to the following UTC timestamps:

* 2017-03-23 00:06:46
* 2017-03-23 01:17:07
* 2017-03-24 17:24:00
Flags: needinfo?(miles)
Also, for a more concise less story-driven format:

59y:

             full name: ip-172-31-53-59
            start time: 2017-03-24 17:24:00
              end time: 2017-03-24 17:25:02
    flailed around for: 1m 2s
     astrological sign: aquarius

120y:

             full name: ip-172-31-62-120

       first life (which was fine):
            start time: 2017-03-22 17:15:19
              end time: 2017-03-22 18:52:58
     astrological sign: taurus

      second life (startup errors):
            start time: 2017-03-23 00:06:46
              end time: 2017-03-23 00:08:09
    flailed around for: 1m 23s
     astrological sign: pisces

201y:

             full name: ip-172-31-60-201
            start time: 2017-03-23 01:17:07
              end time: 2017-03-23 01:18:48
    flailed around for: 1m 41s
     astrological sign: gemini
Recent AMIs were created at:

2017-03-22T20:53:41.000Z
2017-03-23T00:08:07.000Z <= matches 2017-03-23 00:06:46
2017-03-23T01:18:49.000Z <= matches 2017-03-23 01:17:07
2017-03-23T19:08:18.000Z
2017-03-23T22:36:10.000Z
2017-03-24T17:25:04.000Z <= matches 2017-03-24 17:24:00
2017-03-25T00:55:43.000Z
2017-03-28T21:01:25.000Z
2017-03-28T21:27:03.000Z
2017-03-28T22:31:15.000Z
2017-03-29T00:39:56.000Z

Given that these were antenna logs that we were seeing, this makes me think the disable script that we use on the AMIs isn't working to properly disable antenna.

This is completely in line with the theory that we had a while ago - that antenna might be running on the AMIs.
I'm going to implement a different way of disabling antenna on the AMIs than we currently use.
Flags: needinfo?(miles)
I can look into this next week when the logs should be devoid of startup errors.
In the last couple days I fixed an issue that would have caused antenna to start up on the AMI. Mentioned this in backchannel. Should have solved this issue.
Sorry about that. I meant to say something like, "Miles thinks he fixed it. To prove it, I want to watch the logs, but there are too many other startup errors at the moment, so it's too noisy. I'll check next week when things are quieter."
I checked the logs for April 3rd, 4th, and 5th and don't see any examples of startup errors. Marking this as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Switching Antenna bugs to Antenna component.
Component: Infra → Antenna
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: