1342619 - antenna initially has no configuration at startup

Assignee

Description

•

8 years ago

Will and I observed during today's load test that the APM-enabled instance of antenna that was being tested was spawning up to 16 gunicorn workers upon initial startup. This was observed using the logs of the docker-antenna service live on the instance during the test. The logfiles combine many startups and of the docker-antenna service from whenever the supervised process is exited and restarted. Upon starting up, gunicorn spawned 16 workers. We configure gunicorn using env variables passed to Docker that are then used in the `docker run` command is executed by systemd - but we configure gunicorn with 5 workers (2 * cores in a m4.large + 1). In these logfiles, we can see the different antenna workers starting up, and in many cases failing because of a failure to get the correct s3 endpoint url. This failure doesn't happen to all of the workers, though, and after a few restarts of antenna (over a ~1.5m period) we get a "clean" log in which none of the workers have failed, and there are 5 workers on startup. Somewhere in there, a lot of workers failed, and we went from gunicorn starting 16 workers and having quite a few fail to gunicorn starting 5 workers that all work. There are some clear issues here: 1. There are too many gunicorn workers being started some of the time 2. Sometimes we fail to have the correct s3 endpoint url when we need it Why 1 happens is a mystery left to next week. For 2, we theorized that boto3 makes calls against the instance metadata service, and that because we're spawning a bunch of antenna workers, hitting it simultaneously, and getting throttled. It's a bit bizarre that it would throttle as quickly as 16 simultaneous requests, but once any of those first 16 workers fail, the service is killed and restarted by systemd and the antenna workers come back up, eager to hit the instance metadata service. 2 will probably resolve itself once we spawn fewer workers if the theory above is correct. Another solution for 2 is to not use a boto query to populate the s3 endpoint, but rather be a configuration value or the interpolated value of the bucket name + region. Are there other calls to s3 on worker startup that hit instance metadata? (if so, that fix would be ineffective)

Miles Crabill [:miles]

Assignee

Comment 1

•

8 years ago

Correction: we use c4.larges, which has half the memory of m4.larges. [1] It might be prudent to switch to m4.larges for the memory bump. [1] http://www.ec2instances.info/?selected=m4.large,c4.large

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 2

•

8 years ago

Making this block the antenna to -prod bug.

Blocks: 1315258

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 3

•

8 years ago

I'm going to wrap boto3 client creation in a retryable decorator so it'll get retried a few times before ultimately failing. That might alleviate that specific issue assuming our theory is correct.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 4

•

8 years ago

Attached file Link to Github pull-request: https://github.com/mozilla/antenna/pull/164 — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 5

•

8 years ago

Miles: I landed a fix for bug #1342583, so startup errors should make their way to Sentry now. We should push that to -stage and maybe it'll help us shed more light on this bug.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 6

•

8 years ago

There still aren't any startup errors on Sentry. I opened up bug #1347183 to cover that.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 7

•

8 years ago

There aren't any startup errors in Sentry because the startup errors are from s3 connections failing due to a bad endpoint which only happens at startup for Antenna if it's got no configuration. Seems like bug #1347183 (no startup errors in sentry), bug #1347185 (s3 connections fail at startup) and this one all have the same cause--no configuration at startup. I'm assigning this to Miles--it's probably related either to the way the docker container is started or something like that.

Assignee: nobody → miles

Status: NEW → ASSIGNED

Will Kahn-Greene [:willkg] ET needinfo? me

Updated

•

8 years ago

Blocks: 1347183

Will Kahn-Greene [:willkg] ET needinfo? me

Updated

•

8 years ago

Blocks: 1347185

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 8

•

8 years ago

Changing the summary to reflect the key issue.

Summary: Resolve antenna's "too many gunicorn workers" issue → antenna initially has no configuration at startup

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 9

•

8 years ago

I spent some time parsing logs. Here's what I did: 1. grabbed a bunch of logs from the -stage grephost I have logs for 3/19 to now. 2. skimmed the logs for a list of hosts that spit out "Unhandled startup exception" in the logs at some point ip-172-31-16-38 ip-172-31-16-97 ip-172-31-22-207 ip-172-31-30-253 ip-172-31-49-101 ip-172-31-53-59 ip-172-31-58-198 ip-172-31-60-201 ip-172-31-62-120 3. filtered the logs for those hosts on a host-by-host basis Now let me tell you some stories. One day, on the ice planet AWS-EC2, there lived an ephemeral entity known to his friends as ip-172-31-53-59 or 59y for short. In its most recent life, 59y came into being on 2017-03-24 17:24:00, had lots and lots of startup errors and then went totally silent at 2017-03-24 17:25:02. 59y never got a chance to receive or save any crashes. On the third moon of the ice planet AWS-EC2, there lived another ephemeral entity known to his friends as ip-172-31-62-120 or 120y for short. According to the records of the universe as transcribed on the eternal -stage grephost, 120y came into being on 2017-03-22 17:15:19, worked really hard receiving and saving crashes until 2017-03-22 18:52:58 at which point, 120y seems to have ceased to exist. Then roughly 6 hours later, on 2017-03-23 00:06:46, 120y comes back into being, has lots and lots of startup errors and then goes totally silent at 2017-03-23 00:08:09. 120y doesn't save or receive any crashes in this particular run. In the deep crevice of an asteroid elliptically circling (ellipticling?) AWS-EC2, is the third entity we examined. Its name was ip-172-31-60-201, but a devilish space toad once called it 201y, so we'll call it that for brevity. 201y pops into existence at 2017-03-23 01:17:07, has lots and lots of startup errors and snuffs it at 2017-03-23 01:18:48 never to be seen by the eternal grephost again. That's three datapoints. 2017 and 59y are the same: they start up with no configuration, flail around for a few minutes, and then are gone. I think 120y is running fine, is shut down, and then the ip address is used for a new node which starts up with no configuration, flails around for a few minutes, and then is gone. Daniel put out the theory that the case where Antenna is started up with no configuration happens when the AMI is being built--for some reason, docker comes up before things are fully built, Antenna flails around for a few minutes and then it's all shut down. Those logs are persisted in systemd of the AMI and then when new nodes come up, they use that AMI and the logs get transferred to the logging infrastructure. Or something like that. That seems plausible to me from what I'm seeing. Miles: Can you check to see if we built AMIs at or very close to the following UTC timestamps: * 2017-03-23 00:06:46 * 2017-03-23 01:17:07 * 2017-03-24 17:24:00

Flags: needinfo?(miles)

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 10

•

8 years ago

Also, for a more concise less story-driven format: 59y: full name: ip-172-31-53-59 start time: 2017-03-24 17:24:00 end time: 2017-03-24 17:25:02 flailed around for: 1m 2s astrological sign: aquarius 120y: full name: ip-172-31-62-120 first life (which was fine): start time: 2017-03-22 17:15:19 end time: 2017-03-22 18:52:58 astrological sign: taurus second life (startup errors): start time: 2017-03-23 00:06:46 end time: 2017-03-23 00:08:09 flailed around for: 1m 23s astrological sign: pisces 201y: full name: ip-172-31-60-201 start time: 2017-03-23 01:17:07 end time: 2017-03-23 01:18:48 flailed around for: 1m 41s astrological sign: gemini

Miles Crabill [:miles]

Assignee

Comment 11

•

8 years ago

Recent AMIs were created at: 2017-03-22T20:53:41.000Z 2017-03-23T00:08:07.000Z <= matches 2017-03-23 00:06:46 2017-03-23T01:18:49.000Z <= matches 2017-03-23 01:17:07 2017-03-23T19:08:18.000Z 2017-03-23T22:36:10.000Z 2017-03-24T17:25:04.000Z <= matches 2017-03-24 17:24:00 2017-03-25T00:55:43.000Z 2017-03-28T21:01:25.000Z 2017-03-28T21:27:03.000Z 2017-03-28T22:31:15.000Z 2017-03-29T00:39:56.000Z Given that these were antenna logs that we were seeing, this makes me think the disable script that we use on the AMIs isn't working to properly disable antenna. This is completely in line with the theory that we had a while ago - that antenna might be running on the AMIs. I'm going to implement a different way of disabling antenna on the AMIs than we currently use.

Flags: needinfo?(miles)

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 12

•

8 years ago

I can look into this next week when the logs should be devoid of startup errors.

Miles Crabill [:miles]

Assignee

Comment 13

•

8 years ago

In the last couple days I fixed an issue that would have caused antenna to start up on the AMI. Mentioned this in backchannel. Should have solved this issue.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 14

•

8 years ago

Sorry about that. I meant to say something like, "Miles thinks he fixed it. To prove it, I want to watch the logs, but there are too many other startup errors at the moment, so it's too noisy. I'll check next week when things are quieter."

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 15

•

8 years ago

I checked the logs for April 3rd, 4th, and 5th and don't see any examples of startup errors. Marking this as FIXED.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 16

•

8 years ago

Switching Antenna bugs to Antenna component.

Component: Infra → Antenna

Bugzilla

antenna initially has no configuration at startup

Categories

(Socorro :: Antenna, task)

Tracking

(Not tracked)

People

(Reporter: miles, Assigned: miles)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type