Closed
Bug 1342619
Opened 7 years ago
Closed 7 years ago
antenna initially has no configuration at startup
Categories
(Socorro :: Antenna, task)
Socorro
Antenna
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: miles, Assigned: miles)
References
Details
Attachments
(1 file)
Will and I observed during today's load test that the APM-enabled instance of antenna that was being tested was spawning up to 16 gunicorn workers upon initial startup. This was observed using the logs of the docker-antenna service live on the instance during the test. The logfiles combine many startups and of the docker-antenna service from whenever the supervised process is exited and restarted. Upon starting up, gunicorn spawned 16 workers. We configure gunicorn using env variables passed to Docker that are then used in the `docker run` command is executed by systemd - but we configure gunicorn with 5 workers (2 * cores in a m4.large + 1). In these logfiles, we can see the different antenna workers starting up, and in many cases failing because of a failure to get the correct s3 endpoint url. This failure doesn't happen to all of the workers, though, and after a few restarts of antenna (over a ~1.5m period) we get a "clean" log in which none of the workers have failed, and there are 5 workers on startup. Somewhere in there, a lot of workers failed, and we went from gunicorn starting 16 workers and having quite a few fail to gunicorn starting 5 workers that all work. There are some clear issues here: 1. There are too many gunicorn workers being started some of the time 2. Sometimes we fail to have the correct s3 endpoint url when we need it Why 1 happens is a mystery left to next week. For 2, we theorized that boto3 makes calls against the instance metadata service, and that because we're spawning a bunch of antenna workers, hitting it simultaneously, and getting throttled. It's a bit bizarre that it would throttle as quickly as 16 simultaneous requests, but once any of those first 16 workers fail, the service is killed and restarted by systemd and the antenna workers come back up, eager to hit the instance metadata service. 2 will probably resolve itself once we spawn fewer workers if the theory above is correct. Another solution for 2 is to not use a boto query to populate the s3 endpoint, but rather be a configuration value or the interpolated value of the bucket name + region. Are there other calls to s3 on worker startup that hit instance metadata? (if so, that fix would be ineffective)
Assignee | ||
Comment 1•7 years ago
|
||
Correction: we use c4.larges, which has half the memory of m4.larges. [1] It might be prudent to switch to m4.larges for the memory bump. [1] http://www.ec2instances.info/?selected=m4.large,c4.large
Comment 3•7 years ago
|
||
I'm going to wrap boto3 client creation in a retryable decorator so it'll get retried a few times before ultimately failing. That might alleviate that specific issue assuming our theory is correct.
Comment 4•7 years ago
|
||
Comment 5•7 years ago
|
||
Miles: I landed a fix for bug #1342583, so startup errors should make their way to Sentry now. We should push that to -stage and maybe it'll help us shed more light on this bug.
Comment 6•7 years ago
|
||
There still aren't any startup errors on Sentry. I opened up bug #1347183 to cover that.
Comment 7•7 years ago
|
||
There aren't any startup errors in Sentry because the startup errors are from s3 connections failing due to a bad endpoint which only happens at startup for Antenna if it's got no configuration. Seems like bug #1347183 (no startup errors in sentry), bug #1347185 (s3 connections fail at startup) and this one all have the same cause--no configuration at startup. I'm assigning this to Miles--it's probably related either to the way the docker container is started or something like that.
Assignee: nobody → miles
Status: NEW → ASSIGNED
Comment 8•7 years ago
|
||
Changing the summary to reflect the key issue.
Summary: Resolve antenna's "too many gunicorn workers" issue → antenna initially has no configuration at startup
Comment 9•7 years ago
|
||
I spent some time parsing logs. Here's what I did: 1. grabbed a bunch of logs from the -stage grephost I have logs for 3/19 to now. 2. skimmed the logs for a list of hosts that spit out "Unhandled startup exception" in the logs at some point ip-172-31-16-38 ip-172-31-16-97 ip-172-31-22-207 ip-172-31-30-253 ip-172-31-49-101 ip-172-31-53-59 ip-172-31-58-198 ip-172-31-60-201 ip-172-31-62-120 3. filtered the logs for those hosts on a host-by-host basis Now let me tell you some stories. One day, on the ice planet AWS-EC2, there lived an ephemeral entity known to his friends as ip-172-31-53-59 or 59y for short. In its most recent life, 59y came into being on 2017-03-24 17:24:00, had lots and lots of startup errors and then went totally silent at 2017-03-24 17:25:02. 59y never got a chance to receive or save any crashes. On the third moon of the ice planet AWS-EC2, there lived another ephemeral entity known to his friends as ip-172-31-62-120 or 120y for short. According to the records of the universe as transcribed on the eternal -stage grephost, 120y came into being on 2017-03-22 17:15:19, worked really hard receiving and saving crashes until 2017-03-22 18:52:58 at which point, 120y seems to have ceased to exist. Then roughly 6 hours later, on 2017-03-23 00:06:46, 120y comes back into being, has lots and lots of startup errors and then goes totally silent at 2017-03-23 00:08:09. 120y doesn't save or receive any crashes in this particular run. In the deep crevice of an asteroid elliptically circling (ellipticling?) AWS-EC2, is the third entity we examined. Its name was ip-172-31-60-201, but a devilish space toad once called it 201y, so we'll call it that for brevity. 201y pops into existence at 2017-03-23 01:17:07, has lots and lots of startup errors and snuffs it at 2017-03-23 01:18:48 never to be seen by the eternal grephost again. That's three datapoints. 2017 and 59y are the same: they start up with no configuration, flail around for a few minutes, and then are gone. I think 120y is running fine, is shut down, and then the ip address is used for a new node which starts up with no configuration, flails around for a few minutes, and then is gone. Daniel put out the theory that the case where Antenna is started up with no configuration happens when the AMI is being built--for some reason, docker comes up before things are fully built, Antenna flails around for a few minutes and then it's all shut down. Those logs are persisted in systemd of the AMI and then when new nodes come up, they use that AMI and the logs get transferred to the logging infrastructure. Or something like that. That seems plausible to me from what I'm seeing. Miles: Can you check to see if we built AMIs at or very close to the following UTC timestamps: * 2017-03-23 00:06:46 * 2017-03-23 01:17:07 * 2017-03-24 17:24:00
Flags: needinfo?(miles)
Comment 10•7 years ago
|
||
Also, for a more concise less story-driven format: 59y: full name: ip-172-31-53-59 start time: 2017-03-24 17:24:00 end time: 2017-03-24 17:25:02 flailed around for: 1m 2s astrological sign: aquarius 120y: full name: ip-172-31-62-120 first life (which was fine): start time: 2017-03-22 17:15:19 end time: 2017-03-22 18:52:58 astrological sign: taurus second life (startup errors): start time: 2017-03-23 00:06:46 end time: 2017-03-23 00:08:09 flailed around for: 1m 23s astrological sign: pisces 201y: full name: ip-172-31-60-201 start time: 2017-03-23 01:17:07 end time: 2017-03-23 01:18:48 flailed around for: 1m 41s astrological sign: gemini
Assignee | ||
Comment 11•7 years ago
|
||
Recent AMIs were created at: 2017-03-22T20:53:41.000Z 2017-03-23T00:08:07.000Z <= matches 2017-03-23 00:06:46 2017-03-23T01:18:49.000Z <= matches 2017-03-23 01:17:07 2017-03-23T19:08:18.000Z 2017-03-23T22:36:10.000Z 2017-03-24T17:25:04.000Z <= matches 2017-03-24 17:24:00 2017-03-25T00:55:43.000Z 2017-03-28T21:01:25.000Z 2017-03-28T21:27:03.000Z 2017-03-28T22:31:15.000Z 2017-03-29T00:39:56.000Z Given that these were antenna logs that we were seeing, this makes me think the disable script that we use on the AMIs isn't working to properly disable antenna. This is completely in line with the theory that we had a while ago - that antenna might be running on the AMIs. I'm going to implement a different way of disabling antenna on the AMIs than we currently use.
Flags: needinfo?(miles)
Comment 12•7 years ago
|
||
I can look into this next week when the logs should be devoid of startup errors.
Assignee | ||
Comment 13•7 years ago
|
||
In the last couple days I fixed an issue that would have caused antenna to start up on the AMI. Mentioned this in backchannel. Should have solved this issue.
Comment 14•7 years ago
|
||
Sorry about that. I meant to say something like, "Miles thinks he fixed it. To prove it, I want to watch the logs, but there are too many other startup errors at the moment, so it's too noisy. I'll check next week when things are quieter."
Comment 15•7 years ago
|
||
I checked the logs for April 3rd, 4th, and 5th and don't see any examples of startup errors. Marking this as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•