Closed
Bug 1348881
Opened 8 years ago
Closed 8 years ago
is antenna losing data? [antenna]
Categories
(Socorro :: Antenna, task)
Socorro
Antenna
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: willkg, Assigned: willkg)
References
Details
Do all crashes submitted to Antenna make it to s3? Does Antenna lose data under 1x load? Does Antenna lose data under 3x load?
This bug covers figuring out how to figure that out, then answering the questions, then figuring out in which circumstances Antenna does lose data.
Assignee | ||
Comment 1•8 years ago
|
||
I thought I had a bug for this, but didn't. I've been working on this for a week or so now. Will write up summary of analysis so far.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•8 years ago
|
||
First, there was statsd and Datadog. Antenna spits out counts of incoming crashes and crashes it just saved. We threw those into a whoosit in Datadog and ran a bunch of load tests. The problem is that the numbers don't equal one another. Antenna uses UDP to send data to the dd-agent on the node and then that uses TCP to send data to Datadog.
After a bunch of debugging and poking around and reading, the running theory is that under high load situations, the network stack is saturated and thus drops some of the data probably in the UDP-using segment.
I attempted to alleviate this by batching the counter data in Antenna and then sending it out during the heartbeats every 10 seconds. That didn't help much.
Anyhow, the numbers don't match.
The next thing I looked at was the logging infrastructure. Antenna uses the Python logging module to log to stdout. That goes through a series of stages and ends up in a /var/log/grep/antenna.app.docker.antenna_YYYY-MM-DD.log file.
I wrote a log parser (https://github.com/willkg/antenna-log-parser) to parse a time span in one or more log files, track hosts, track crashes received and saved and spit it all out telling us whether the number of crashes received == saved and if not, which hosts/processes were having problems. The parser works great.
Of the 6 load tests I've analyzed, 2 of them have cases of crashes that were received but not saved AND crashes that were saved but not received.
One of the problem load tests was a single-node load test. That load test is designed to kill the node, so we can probably ignore this one.
The other was a cluster load test. That load test had 2160 crashes received but not saved and 2180 crashes saved but not received. How do we end up with crashes saved but not received in any other way? I did a spot check of a bunch of the crash ids for crashes saved but not received in the logs, but they never have receive lines.
One theory here is that we're scaling down fast and the log data doesn't have enough time to make it to the grephost.
Miles wants to shut off scaledown, run load tests, wait some period of time, then check them afterwards. This also gives us a chance to pull the logs directly from the nodes and compare them with what I'm seeing from the grephost.
Assignee | ||
Comment 3•8 years ago
|
||
Here's the output from the load test that ran last night:
[wkahngreene@ip-172-31-29-246 ~]$ python antenna-log-parser/log_parser.py "2017-03-21 03:50" "2017-03-21 05:44" /var/log/grep/antenna.app.docker.antenna_current.log
lines: 967956
From 2017-03-21 03:50 to 2017-03-21 05:44
total crashes in: 483978
total crashes out: 483978
Hosts (35):
ip-172-31-11-57 10 2017-03-21 05:07:05 +0000 2017-03-21 05:35:45 +0000 6258 6258 True
ip-172-31-11-57 11 2017-03-21 05:07:05 +0000 2017-03-21 05:35:47 +0000 5265 5265 True
ip-172-31-11-57 12 2017-03-21 05:07:06 +0000 2017-03-21 05:35:48 +0000 4847 4847 True
ip-172-31-11-57 13 2017-03-21 05:07:06 +0000 2017-03-21 05:35:46 +0000 5736 5736 True
ip-172-31-11-57 9 2017-03-21 05:07:05 +0000 2017-03-21 05:35:48 +0000 9755 9755 True
ip-172-31-22-207 10 2017-03-21 04:32:07 +0000 2017-03-21 05:35:46 +0000 11659 11659 True
ip-172-31-22-207 11 2017-03-21 04:32:07 +0000 2017-03-21 05:35:47 +0000 10703 10703 True
ip-172-31-22-207 12 2017-03-21 04:32:07 +0000 2017-03-21 05:35:48 +0000 13895 13895 True
ip-172-31-22-207 13 2017-03-21 04:32:07 +0000 2017-03-21 05:35:47 +0000 12769 12769 True
ip-172-31-22-207 9 2017-03-21 04:32:07 +0000 2017-03-21 05:35:46 +0000 22398 22398 True
ip-172-31-36-31 10 2017-03-21 04:22:46 +0000 2017-03-21 05:35:46 +0000 14240 14240 True
ip-172-31-36-31 11 2017-03-21 04:22:46 +0000 2017-03-21 05:35:47 +0000 15573 15573 True
ip-172-31-36-31 12 2017-03-21 04:22:46 +0000 2017-03-21 05:35:46 +0000 12802 12802 True
ip-172-31-36-31 13 2017-03-21 04:22:46 +0000 2017-03-21 05:35:48 +0000 17143 17143 True
ip-172-31-36-31 9 2017-03-21 04:22:46 +0000 2017-03-21 05:35:47 +0000 27115 27115 True
ip-172-31-48-34 10 2017-03-21 05:06:22 +0000 2017-03-21 05:35:46 +0000 5540 5540 True
ip-172-31-48-34 11 2017-03-21 05:06:22 +0000 2017-03-21 05:35:47 +0000 6490 6490 True
ip-172-31-48-34 12 2017-03-21 05:06:22 +0000 2017-03-21 05:35:48 +0000 10245 10245 True
ip-172-31-48-34 13 2017-03-21 05:06:23 +0000 2017-03-21 05:35:43 +0000 4988 4988 True
ip-172-31-48-34 9 2017-03-21 05:06:23 +0000 2017-03-21 05:35:46 +0000 5973 5973 True
ip-172-31-54-80 10 2017-03-21 04:34:31 +0000 2017-03-21 05:35:46 +0000 10205 10205 True
ip-172-31-54-80 11 2017-03-21 04:34:31 +0000 2017-03-21 05:35:46 +0000 11306 11306 True
ip-172-31-54-80 12 2017-03-21 04:34:31 +0000 2017-03-21 05:35:45 +0000 12228 12228 True
ip-172-31-54-80 13 2017-03-21 04:34:29 +0000 2017-03-21 05:35:47 +0000 13444 13444 True
ip-172-31-54-80 9 2017-03-21 04:34:29 +0000 2017-03-21 05:35:48 +0000 22262 22262 True
ip-172-31-58-252 11 2017-03-21 03:56:10 +0000 2017-03-21 05:35:46 +0000 22047 22047 True
ip-172-31-58-252 12 2017-03-21 03:56:10 +0000 2017-03-21 05:35:48 +0000 23885 23885 True
ip-172-31-58-252 13 2017-03-21 03:56:10 +0000 2017-03-21 05:35:46 +0000 19271 19271 True
ip-172-31-58-252 14 2017-03-21 03:56:10 +0000 2017-03-21 05:35:45 +0000 20660 20660 True
ip-172-31-58-252 15 2017-03-21 03:56:10 +0000 2017-03-21 05:35:47 +0000 34072 34072 True
ip-172-31-7-181 10 2017-03-21 04:32:09 +0000 2017-03-21 05:35:44 +0000 10384 10384 True
ip-172-31-7-181 11 2017-03-21 04:32:10 +0000 2017-03-21 05:35:48 +0000 12693 12693 True
ip-172-31-7-181 12 2017-03-21 04:32:09 +0000 2017-03-21 05:35:47 +0000 22381 22381 True
ip-172-31-7-181 13 2017-03-21 04:32:09 +0000 2017-03-21 05:35:46 +0000 14022 14022 True
ip-172-31-7-181 9 2017-03-21 04:32:09 +0000 2017-03-21 05:35:48 +0000 11724 11724 True
Received but not saved (0):
Saved but not received (0):
Crashes in == crashes out, so that's good. Yay!
Assignee | ||
Comment 4•8 years ago
|
||
Miles pulled the logs for all the nodes involved in that load test. I ran them through my log parser of AWESOME and got the same results as in comment #3.
For the record, that was this load test:
https://app.datadoghq.com/dash/207347/antenna?live=false&page=0&is_auto=false&from_ts=1490066491546&to_ts=1490077141248&tile_size=m&fullscreen=false
We're going to run a second load test with downscaling shut off, aggregate the logs, and verify the data.
Assignee | ||
Comment 5•8 years ago
|
||
On this load test:
https://app.datadoghq.com/dash/207347/antenna?live=false&page=0&is_auto=false&from_ts=1490132372023&to_ts=1490141475801&tile_size=m&fullscreen=false
My log parser got this on the grephost:
[wkahngreene@ip-172-31-29-246 ~]$ python antenna-log-parser/log_parser.py "2017-03-21 22:00" "2017-03-21 23:59" antenna.app.docker.antenna_2017-03-20 antenna.app.docker.antenna_2017-03-21.log
lines: 925344
From 2017-03-21 22:00 to 2017-03-21 23:59
total crashes in: 462672
total crashes out: 462672
Hosts (45):
ip-172-31-13-219 10 2017-03-21 22:53:52 +0000 2017-03-21 23:36:19 +0000 8261 8261 True
ip-172-31-13-219 11 2017-03-21 22:53:51 +0000 2017-03-21 23:36:18 +0000 6826 6826 True
ip-172-31-13-219 12 2017-03-21 22:53:52 +0000 2017-03-21 23:36:19 +0000 9018 9018 True
ip-172-31-13-219 13 2017-03-21 22:53:51 +0000 2017-03-21 23:36:19 +0000 7507 7507 True
ip-172-31-13-219 9 2017-03-21 22:53:52 +0000 2017-03-21 23:36:19 +0000 14770 14770 True
ip-172-31-21-25 10 2017-03-21 22:23:17 +0000 2017-03-21 23:36:19 +0000 17058 17058 True
ip-172-31-21-25 11 2017-03-21 22:23:17 +0000 2017-03-21 23:36:19 +0000 13197 13197 True
ip-172-31-21-25 12 2017-03-21 22:23:17 +0000 2017-03-21 23:36:19 +0000 27669 27669 True
ip-172-31-21-25 13 2017-03-21 22:23:17 +0000 2017-03-21 23:36:19 +0000 14534 14534 True
ip-172-31-21-25 9 2017-03-21 22:23:17 +0000 2017-03-21 23:36:18 +0000 15773 15773 True
ip-172-31-29-5 10 2017-03-21 23:16:20 +0000 2017-03-21 23:36:19 +0000 3071 3071 True
ip-172-31-29-5 11 2017-03-21 23:16:21 +0000 2017-03-21 23:36:19 +0000 3468 3468 True
ip-172-31-29-5 12 2017-03-21 23:16:20 +0000 2017-03-21 23:36:19 +0000 2741 2741 True
ip-172-31-29-5 13 2017-03-21 23:16:20 +0000 2017-03-21 23:36:20 +0000 6675 6675 True
ip-172-31-29-5 9 2017-03-21 23:16:21 +0000 2017-03-21 23:36:19 +0000 3822 3822 True
ip-172-31-32-201 10 2017-03-21 22:33:48 +0000 2017-03-21 23:36:19 +0000 14158 14158 True
ip-172-31-32-201 11 2017-03-21 22:33:49 +0000 2017-03-21 23:36:19 +0000 12876 12876 True
ip-172-31-32-201 12 2017-03-21 22:33:49 +0000 2017-03-21 23:36:18 +0000 11574 11574 True
ip-172-31-32-201 13 2017-03-21 22:33:48 +0000 2017-03-21 23:36:19 +0000 22461 22461 True
ip-172-31-32-201 9 2017-03-21 22:33:49 +0000 2017-03-21 23:36:19 +0000 10244 10244 True
ip-172-31-35-84 10 2017-03-21 23:16:23 +0000 2017-03-21 23:36:18 +0000 2743 2743 True
ip-172-31-35-84 11 2017-03-21 23:16:23 +0000 2017-03-21 23:36:19 +0000 3838 3838 True
ip-172-31-35-84 12 2017-03-21 23:16:23 +0000 2017-03-21 23:36:18 +0000 3433 3433 True
ip-172-31-35-84 13 2017-03-21 23:16:23 +0000 2017-03-21 23:36:19 +0000 6679 6679 True
ip-172-31-35-84 9 2017-03-21 23:16:23 +0000 2017-03-21 23:36:19 +0000 3105 3105 True
ip-172-31-48-65 11 2017-03-21 22:23:17 +0000 2017-03-21 23:36:19 +0000 13201 13201 True
ip-172-31-48-65 12 2017-03-21 22:23:17 +0000 2017-03-21 23:36:18 +0000 14417 14417 True
ip-172-31-48-65 13 2017-03-21 22:23:17 +0000 2017-03-21 23:36:19 +0000 26677 26677 True
ip-172-31-48-65 14 2017-03-21 22:23:17 +0000 2017-03-21 23:36:19 +0000 17193 17193 True
ip-172-31-48-65 15 2017-03-21 22:23:17 +0000 2017-03-21 23:36:19 +0000 15836 15836 True
ip-172-31-5-63 10 2017-03-21 23:15:50 +0000 2017-03-21 23:36:19 +0000 3594 3594 True
ip-172-31-5-63 11 2017-03-21 23:15:51 +0000 2017-03-21 23:36:19 +0000 3999 3999 True
ip-172-31-5-63 12 2017-03-21 23:15:52 +0000 2017-03-21 23:36:19 +0000 6833 6833 True
ip-172-31-5-63 13 2017-03-21 23:15:50 +0000 2017-03-21 23:36:19 +0000 3223 3223 True
ip-172-31-5-63 9 2017-03-21 23:15:52 +0000 2017-03-21 23:36:18 +0000 2794 2794 True
ip-172-31-54-162 10 2017-03-21 22:56:53 +0000 2017-03-21 23:36:19 +0000 8308 8308 True
ip-172-31-54-162 11 2017-03-21 22:56:55 +0000 2017-03-21 23:36:19 +0000 6196 6196 True
ip-172-31-54-162 12 2017-03-21 22:56:53 +0000 2017-03-21 23:36:19 +0000 13713 13713 True
ip-172-31-54-162 13 2017-03-21 22:56:56 +0000 2017-03-21 23:36:18 +0000 7558 7558 True
ip-172-31-54-162 9 2017-03-21 22:56:54 +0000 2017-03-21 23:36:19 +0000 6995 6995 True
ip-172-31-55-75 10 2017-03-21 22:37:05 +0000 2017-03-21 23:36:19 +0000 9745 9745 True
ip-172-31-55-75 11 2017-03-21 22:37:05 +0000 2017-03-21 23:36:19 +0000 21296 21296 True
ip-172-31-55-75 12 2017-03-21 22:37:05 +0000 2017-03-21 23:36:19 +0000 13024 13024 True
ip-172-31-55-75 13 2017-03-21 22:37:05 +0000 2017-03-21 23:36:19 +0000 10669 10669 True
ip-172-31-55-75 9 2017-03-21 22:37:05 +0000 2017-03-21 23:36:19 +0000 11900 11900 True
Received but not saved (0):
Saved but not received (0):
Assignee | ||
Comment 6•8 years ago
|
||
Miles pulled the logs from all the nodes and I ran my log parser and got the same results as comment #5.
So that's 2 for 2!
Assignee | ||
Comment 7•8 years ago
|
||
On this load test:
https://app.datadoghq.com/dash/207347/antenna?live=false&page=0&is_auto=false&from_ts=1490201552660&to_ts=1490207450075&tile_size=m&fullscreen=false
my log parser got this on the grephost:
[wkahngreene@ip-172-31-29-246 ~]$ python antenna-log-parser/log_parser.py "2017-03-22 16:50" "2017-03-22 18:30" /var/log/grep/antenna.app.docker.antenna_current.log
lines: 941720
From 2017-03-22 16:50 to 2017-03-22 18:30
total crashes in: 470860
total crashes out: 470860
Hosts (45):
ip-172-31-0-70 10 2017-03-22 17:47:38 +0000 2017-03-22 18:18:45 +0000 5937 5937 True
ip-172-31-0-70 11 2017-03-22 17:47:40 +0000 2017-03-22 18:18:35 +0000 7527 7527 True
ip-172-31-0-70 12 2017-03-22 17:47:38 +0000 2017-03-22 18:18:36 +0000 11657 11657 True
ip-172-31-0-70 13 2017-03-22 17:47:40 +0000 2017-03-22 18:18:35 +0000 6394 6394 True
ip-172-31-0-70 9 2017-03-22 17:47:40 +0000 2017-03-22 18:18:35 +0000 6964 6964 True
ip-172-31-1-181 10 2017-03-22 17:37:06 +0000 2017-03-22 18:18:34 +0000 8202 8202 True
ip-172-31-1-181 11 2017-03-22 17:37:05 +0000 2017-03-22 18:18:34 +0000 7508 7508 True
ip-172-31-1-181 12 2017-03-22 17:37:05 +0000 2017-03-22 18:18:33 +0000 15128 15128 True
ip-172-31-1-181 13 2017-03-22 17:37:05 +0000 2017-03-22 18:18:33 +0000 9644 9644 True
ip-172-31-1-181 9 2017-03-22 17:37:05 +0000 2017-03-22 18:18:35 +0000 8864 8864 True
ip-172-31-14-95 11 2017-03-22 17:05:34 +0000 2017-03-22 18:18:35 +0000 15632 15632 True
ip-172-31-14-95 12 2017-03-22 17:05:34 +0000 2017-03-22 18:18:35 +0000 16754 16754 True
ip-172-31-14-95 13 2017-03-22 17:05:33 +0000 2017-03-22 18:18:34 +0000 26540 26540 True
ip-172-31-14-95 14 2017-03-22 17:05:34 +0000 2017-03-22 18:18:35 +0000 18137 18137 True
ip-172-31-14-95 15 2017-03-22 17:05:35 +0000 2017-03-22 18:18:34 +0000 14060 14060 True
ip-172-31-24-80 10 2017-03-22 17:05:35 +0000 2017-03-22 18:18:35 +0000 16761 16761 True
ip-172-31-24-80 11 2017-03-22 17:05:34 +0000 2017-03-22 18:18:35 +0000 18177 18177 True
ip-172-31-24-80 12 2017-03-22 17:05:34 +0000 2017-03-22 18:18:35 +0000 14045 14045 True
ip-172-31-24-80 13 2017-03-22 17:05:34 +0000 2017-03-22 18:18:35 +0000 26687 26687 True
ip-172-31-24-80 9 2017-03-22 17:05:33 +0000 2017-03-22 18:18:35 +0000 15536 15536 True
ip-172-31-29-185 10 2017-03-22 18:18:05 +0000 2017-03-22 18:18:35 +0000 63 63 True
ip-172-31-29-185 11 2017-03-22 18:18:07 +0000 2017-03-22 18:18:35 +0000 59 59 True
ip-172-31-29-185 12 2017-03-22 18:18:06 +0000 2017-03-22 18:18:35 +0000 76 76 True
ip-172-31-29-185 13 2017-03-22 18:18:06 +0000 2017-03-22 18:18:35 +0000 129 129 True
ip-172-31-29-185 9 2017-03-22 18:18:07 +0000 2017-03-22 18:18:35 +0000 62 62 True
ip-172-31-33-133 10 2017-03-22 17:36:23 +0000 2017-03-22 18:18:35 +0000 8961 8961 True
ip-172-31-33-133 11 2017-03-22 17:36:22 +0000 2017-03-22 18:18:35 +0000 15391 15391 True
ip-172-31-33-133 12 2017-03-22 17:36:22 +0000 2017-03-22 18:18:34 +0000 9676 9676 True
ip-172-31-33-133 13 2017-03-22 17:36:22 +0000 2017-03-22 18:18:35 +0000 7548 7548 True
ip-172-31-33-133 14 2017-03-22 17:36:22 +0000 2017-03-22 18:18:35 +0000 8262 8262 True
ip-172-31-36-142 10 2017-03-22 17:16:24 +0000 2017-03-22 18:18:35 +0000 23053 23053 True
ip-172-31-36-142 11 2017-03-22 17:16:24 +0000 2017-03-22 18:18:35 +0000 12276 12276 True
ip-172-31-36-142 12 2017-03-22 17:16:24 +0000 2017-03-22 18:18:34 +0000 11395 11395 True
ip-172-31-36-142 13 2017-03-22 17:16:24 +0000 2017-03-22 18:18:36 +0000 14581 14581 True
ip-172-31-36-142 9 2017-03-22 17:16:24 +0000 2017-03-22 18:18:35 +0000 13491 13491 True
ip-172-31-53-164 10 2017-03-22 18:18:15 +0000 2017-03-22 18:18:35 +0000 71 71 True
ip-172-31-53-164 11 2017-03-22 18:18:15 +0000 2017-03-22 18:18:35 +0000 56 56 True
ip-172-31-53-164 12 2017-03-22 18:18:15 +0000 2017-03-22 18:18:35 +0000 58 58 True
ip-172-31-53-164 13 2017-03-22 18:18:15 +0000 2017-03-22 18:18:35 +0000 104 104 True
ip-172-31-53-164 9 2017-03-22 18:18:16 +0000 2017-03-22 18:18:35 +0000 67 67 True
ip-172-31-62-120 10 2017-03-22 17:16:05 +0000 2017-03-22 18:18:35 +0000 13490 13490 True
ip-172-31-62-120 11 2017-03-22 17:16:05 +0000 2017-03-22 18:18:35 +0000 23274 23274 True
ip-172-31-62-120 12 2017-03-22 17:16:06 +0000 2017-03-22 18:18:35 +0000 11442 11442 True
ip-172-31-62-120 13 2017-03-22 17:16:05 +0000 2017-03-22 18:18:35 +0000 14745 14745 True
ip-172-31-62-120 9 2017-03-22 17:16:05 +0000 2017-03-22 18:18:34 +0000 12376 12376 True
Received but not saved (0):
Saved but not received (0):
Miles pulled the logs from all the nodes and I ran my log parser and got the exact same results.
So that's 3 for 3!
Assignee | ||
Comment 8•8 years ago
|
||
On this load test:
https://app.datadoghq.com/dash/207347/antenna?live=false&page=0&is_auto=false&from_ts=1490215652508&to_ts=1490226552803&tile_size=m&fullscreen=false
we did a more complex load test that also involved deploys and changing of the autoscaling configuration. From Miles' email:
> We just finished executing one of the more wacky series of tests:
> - started a load test on deployment 146
> - scales to meet load
> - started deployment 147 midway through
> - cuts over from the 8 nodes that 146 has to the 2 nodes that 147 has at deploy time
> - scales to meet load
> - started deployment 148 midway through
> - cuts over from the 6 nodes that 147 has to the 2 nodes that 148 has at deploy time
> - scales to meet load
> - scales down to 2 nodes when traffic goes away (so we don’t get logs from the nodes except for those two)
Miles aggregated the logs from the nodes that weren't scaled down. I ran them through the log parser and got this:
(dockerize-bootstrap=e9cd5 antenna-catcher/ antenna-log-parser/ antennalib/ bin/ config/ diff_raw_crash.sh ...) ~/mozilla/socorro-zero/tmp/dataloss/20170322_2100> python ~/mozilla/socorro-zero/antenna-log-parser/log_parser.py "2017-03
-22 21:00" "2017-03-22 23:59" */*
lines: 1227540
From 2017-03-22 21:00 to 2017-03-22 23:59
total crashes in: 613786
total crashes out: 613754
Hosts (75):
ip-172-31-14-118 10 2017-03-22 22:45:51 +0000 2017-03-22 22:46:05 +0000 30 30 True
ip-172-31-14-118 11 2017-03-22 22:45:51 +0000 2017-03-22 22:46:04 +0000 19 19 True
ip-172-31-14-118 12 2017-03-22 22:45:51 +0000 2017-03-22 22:46:04 +0000 28 28 True
ip-172-31-14-118 13 2017-03-22 22:45:52 +0000 2017-03-22 22:46:05 +0000 18 18 True
ip-172-31-14-118 9 2017-03-22 22:45:51 +0000 2017-03-22 22:46:05 +0000 48 48 True
ip-172-31-15-21 10 2017-03-22 21:23:15 +0000 2017-03-22 22:15:51 +0000 9126 9125 False
ip-172-31-15-21 11 2017-03-22 21:23:13 +0000 2017-03-22 22:15:51 +0000 9961 9960 False
ip-172-31-15-21 12 2017-03-22 21:23:12 +0000 2017-03-22 22:15:51 +0000 11070 11069 False
ip-172-31-15-21 13 2017-03-22 21:23:11 +0000 2017-03-22 22:15:51 +0000 19529 19529 True
ip-172-31-15-21 9 2017-03-22 21:23:11 +0000 2017-03-22 22:15:51 +0000 11961 11961 True
ip-172-31-19-237 10 2017-03-22 21:42:15 +0000 2017-03-22 22:15:51 +0000 5307 5307 True
ip-172-31-19-237 11 2017-03-22 21:42:15 +0000 2017-03-22 22:15:50 +0000 5920 5920 True
ip-172-31-19-237 12 2017-03-22 21:42:15 +0000 2017-03-22 22:15:51 +0000 7327 7325 False
ip-172-31-19-237 13 2017-03-22 21:42:15 +0000 2017-03-22 22:15:51 +0000 11533 11531 False
ip-172-31-19-237 9 2017-03-22 21:42:15 +0000 2017-03-22 22:15:51 +0000 6707 6707 True
ip-172-31-20-200 10 2017-03-22 22:25:01 +0000 2017-03-22 22:46:03 +0000 3613 3611 False
ip-172-31-20-200 11 2017-03-22 22:25:00 +0000 2017-03-22 22:46:04 +0000 7752 7750 False
ip-172-31-20-200 12 2017-03-22 22:25:00 +0000 2017-03-22 22:46:04 +0000 4510 4510 True
ip-172-31-20-200 13 2017-03-22 22:25:01 +0000 2017-03-22 22:46:04 +0000 4062 4062 True
ip-172-31-20-200 9 2017-03-22 22:25:00 +0000 2017-03-22 22:46:05 +0000 4897 4896 False
ip-172-31-22-121 10 2017-03-22 21:11:55 +0000 2017-03-22 22:15:51 +0000 24283 24282 False
ip-172-31-22-121 11 2017-03-22 21:11:55 +0000 2017-03-22 22:15:50 +0000 12229 12229 True
ip-172-31-22-121 12 2017-03-22 21:11:55 +0000 2017-03-22 22:15:51 +0000 15654 15653 False
ip-172-31-22-121 13 2017-03-22 21:11:55 +0000 2017-03-22 22:15:51 +0000 14635 14634 False
ip-172-31-22-121 9 2017-03-22 21:11:56 +0000 2017-03-22 22:15:51 +0000 13417 13416 False
ip-172-31-31-150 10 2017-03-22 23:10:03 +0000 2017-03-22 23:28:14 +0000 3740 3740 True
ip-172-31-31-150 11 2017-03-22 23:10:03 +0000 2017-03-22 23:28:13 +0000 3022 3022 True
ip-172-31-31-150 12 2017-03-22 23:10:04 +0000 2017-03-22 23:28:14 +0000 2708 2708 True
ip-172-31-31-150 13 2017-03-22 23:10:03 +0000 2017-03-22 23:28:14 +0000 6468 6468 True
ip-172-31-31-150 9 2017-03-22 23:10:03 +0000 2017-03-22 23:28:14 +0000 3339 3339 True
ip-172-31-35-13 11 2017-03-22 22:44:22 +0000 2017-03-22 23:28:14 +0000 16131 16131 True
ip-172-31-35-13 12 2017-03-22 22:44:22 +0000 2017-03-22 23:28:15 +0000 11536 11536 True
ip-172-31-35-13 13 2017-03-22 22:44:22 +0000 2017-03-22 23:28:14 +0000 12813 12813 True
ip-172-31-35-13 14 2017-03-22 22:44:22 +0000 2017-03-22 23:28:14 +0000 12055 12055 True
ip-172-31-35-13 15 2017-03-22 22:44:22 +0000 2017-03-22 23:28:14 +0000 10828 10828 True
ip-172-31-36-164 10 2017-03-22 22:15:20 +0000 2017-03-22 22:46:05 +0000 12959 12958 False
ip-172-31-36-164 11 2017-03-22 22:15:20 +0000 2017-03-22 22:46:04 +0000 9257 9256 False
ip-172-31-36-164 12 2017-03-22 22:15:21 +0000 2017-03-22 22:46:04 +0000 8868 8867 False
ip-172-31-36-164 13 2017-03-22 22:15:20 +0000 2017-03-22 22:46:04 +0000 8975 8975 True
ip-172-31-36-164 9 2017-03-22 22:15:20 +0000 2017-03-22 22:46:05 +0000 9890 9888 False
ip-172-31-37-224 10 2017-03-22 21:45:06 +0000 2017-03-22 22:15:51 +0000 6716 6715 False
ip-172-31-37-224 11 2017-03-22 21:45:07 +0000 2017-03-22 22:15:52 +0000 6026 6026 True
ip-172-31-37-224 12 2017-03-22 21:45:07 +0000 2017-03-22 22:15:57 +0000 4958 4958 True
ip-172-31-37-224 13 2017-03-22 21:45:06 +0000 2017-03-22 22:15:51 +0000 5514 5514 True
ip-172-31-37-224 9 2017-03-22 21:45:06 +0000 2017-03-22 22:15:56 +0000 11007 11007 True
ip-172-31-44-238 10 2017-03-22 21:25:06 +0000 2017-03-22 22:15:51 +0000 11463 11462 False
ip-172-31-44-238 11 2017-03-22 21:25:13 +0000 2017-03-22 22:15:51 +0000 8622 8622 True
ip-172-31-44-238 12 2017-03-22 21:25:14 +0000 2017-03-22 22:15:51 +0000 9512 9511 False
ip-172-31-44-238 13 2017-03-22 21:25:07 +0000 2017-03-22 22:15:51 +0000 10506 10506 True
ip-172-31-44-238 9 2017-03-22 21:25:06 +0000 2017-03-22 22:15:51 +0000 18738 18737 False
ip-172-31-5-206 10 2017-03-22 22:15:00 +0000 2017-03-22 22:15:51 +0000 154 154 True
ip-172-31-5-206 11 2017-03-22 22:15:00 +0000 2017-03-22 22:15:51 +0000 199 199 True
ip-172-31-5-206 12 2017-03-22 22:15:00 +0000 2017-03-22 22:15:51 +0000 322 322 True
ip-172-31-5-206 13 2017-03-22 22:15:00 +0000 2017-03-22 22:15:51 +0000 137 137 True
ip-172-31-5-206 9 2017-03-22 22:15:01 +0000 2017-03-22 22:15:51 +0000 170 170 True
ip-172-31-51-69 10 2017-03-22 22:25:04 +0000 2017-03-22 22:46:04 +0000 3630 3630 True
ip-172-31-51-69 11 2017-03-22 22:25:05 +0000 2017-03-22 22:46:04 +0000 4094 4094 True
ip-172-31-51-69 12 2017-03-22 22:25:05 +0000 2017-03-22 22:46:05 +0000 4882 4882 True
ip-172-31-51-69 13 2017-03-22 22:25:06 +0000 2017-03-22 22:46:04 +0000 4516 4515 False
ip-172-31-51-69 9 2017-03-22 22:25:05 +0000 2017-03-22 22:46:04 +0000 7819 7819 True
ip-172-31-53-130 10 2017-03-22 21:51:11 +0000 2017-03-22 22:15:51 +0000 4394 4394 True
ip-172-31-53-130 11 2017-03-22 21:51:12 +0000 2017-03-22 22:15:51 +0000 3946 3946 True
ip-172-31-53-130 12 2017-03-22 21:51:11 +0000 2017-03-22 22:15:51 +0000 5426 5426 True
ip-172-31-53-130 13 2017-03-22 21:51:12 +0000 2017-03-22 22:15:51 +0000 4864 4864 True
ip-172-31-53-130 9 2017-03-22 21:51:11 +0000 2017-03-22 22:15:51 +0000 8821 8821 True
ip-172-31-61-147 11 2017-03-22 21:11:55 +0000 2017-03-22 22:15:51 +0000 12251 12251 True
ip-172-31-61-147 12 2017-03-22 21:11:55 +0000 2017-03-22 22:15:53 +0000 23378 23378 True
ip-172-31-61-147 13 2017-03-22 21:11:55 +0000 2017-03-22 22:15:51 +0000 13212 13212 True
ip-172-31-61-147 14 2017-03-22 21:11:56 +0000 2017-03-22 22:15:51 +0000 14564 14564 True
ip-172-31-61-147 15 2017-03-22 21:11:55 +0000 2017-03-22 22:15:51 +0000 15657 15656 False
ip-172-31-8-109 11 2017-03-22 22:14:19 +0000 2017-03-22 22:46:04 +0000 9499 9499 True
ip-172-31-8-109 12 2017-03-22 22:14:19 +0000 2017-03-22 22:46:04 +0000 13131 13129 False
ip-172-31-8-109 13 2017-03-22 22:14:19 +0000 2017-03-22 22:46:05 +0000 9973 9973 True
ip-172-31-8-109 14 2017-03-22 22:14:19 +0000 2017-03-22 22:46:04 +0000 9160 9159 False
ip-172-31-8-109 15 2017-03-22 22:14:19 +0000 2017-03-22 22:46:04 +0000 10300 10298 False
Received but not saved (32):
CrashEvent(timestamp='2017-03-22 22:41:53 +0000', host='ip-172-31-8-109 12', crashid='42b5bf13-0a75-4442-b5b0-f23512170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:41:53 +0000', host='ip-172-31-36-164 11', crashid='8a541566-551d-4d80-9a56-eb9b32170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:52 +0000', host='ip-172-31-36-164 10', crashid='8fbeb7ae-3625-486d-9973-78c722170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:34 +0000', host='ip-172-31-44-238 10', crashid='e92d09c8-ec4c-475b-a259-114592170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:52 +0000', host='ip-172-31-36-164 12', crashid='c447bc97-4808-4a9f-9df3-dda582170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:41:59 +0000', host='ip-172-31-20-200 11', crashid='068ecad6-f392-4cd2-821e-518a82170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:52 +0000', host='ip-172-31-20-200 10', crashid='2c0446ac-617e-48c8-94fe-cbbce2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:11:34 +0000', host='ip-172-31-19-237 13', crashid='8342dd00-ed45-4a65-8461-4e8192170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:58 +0000', host='ip-172-31-8-109 15', crashid='5d0a5430-6f62-4cde-bd07-6d2632170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:11:35 +0000', host='ip-172-31-22-121 9', crashid='095f8855-ea57-4c63-8c53-760402170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:52 +0000', host='ip-172-31-20-200 10', crashid='b30c1e67-0b30-4cc6-8f2e-eed9f2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:33 +0000', host='ip-172-31-19-237 13', crashid='0f4b1ffe-b3b5-41d4-8ab0-b39fc2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:33 +0000', host='ip-172-31-22-121 12', crashid='db5173f7-ceab-4363-a176-435de2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:58 +0000', host='ip-172-31-20-200 11', crashid='2c3048f1-a52b-40ed-823e-bf0ef2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:34 +0000', host='ip-172-31-15-21 11', crashid='205b8e41-4682-4f00-9a33-36dee2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:11:34 +0000', host='ip-172-31-37-224 10', crashid='9bf56bd7-1a54-4525-adc1-109972170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:34 +0000', host='ip-172-31-44-238 9', crashid='9fcbe75e-95f9-444f-8345-564b52170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:58 +0000', host='ip-172-31-8-109 14', crashid='8c83fc8c-e982-451c-8921-692942170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:33 +0000', host='ip-172-31-22-121 13', crashid='226ef4ed-5da8-4632-8c49-353222170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:58 +0000', host='ip-172-31-51-69 13', crashid='202bfe28-e865-4bab-9556-36ce92170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:11:35 +0000', host='ip-172-31-19-237 12', crashid='149a5865-4d29-4e8b-a23c-fcc612170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:34 +0000', host='ip-172-31-15-21 12', crashid='79c51048-285c-48b2-8dfb-7c62e2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:34 +0000', host='ip-172-31-44-238 12', crashid='9ab9d6a1-f480-42ba-bb37-ebbdd2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:52 +0000', host='ip-172-31-36-164 9', crashid='bfd09628-ad32-40a5-9d95-d794e2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:33 +0000', host='ip-172-31-61-147 15', crashid='d3ceb971-1206-417b-a21e-00ffd2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:58 +0000', host='ip-172-31-36-164 9', crashid='d5ecaba3-33ea-4af2-b3ae-423f52170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:58 +0000', host='ip-172-31-20-200 9', crashid='134d1980-8d06-4cf9-8782-9a9762170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:33 +0000', host='ip-172-31-15-21 10', crashid='df34ec99-6fb2-489e-a5bc-e7ab72170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:34 +0000', host='ip-172-31-22-121 10', crashid='03d70adf-9a4f-4558-8d06-122ce2170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:42:52 +0000', host='ip-172-31-8-109 15', crashid='4aee850f-2d82-44ce-879e-7f1e62170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:41:59 +0000', host='ip-172-31-8-109 12', crashid='dcfc3168-401c-4de5-9cbf-d05d92170322', action='receive')
CrashEvent(timestamp='2017-03-22 22:12:33 +0000', host='ip-172-31-19-237 12', crashid='aff64830-bb0a-4dd6-a3b1-d277d2170322', action='receive')
Saved but not received (0):
Thus according to the logs on the nodes, we did incur some data loss--lost 32 crashes out of 613786. I'll talk more about that next.
I looked at the grephost. It had logs for the nodes that were scaled down. It said we lost 48 crashes out of 850244.
If I remove the hosts that were scaled down, then my log parser output for the grephost logs matched what we got from aggregating the logs from the nodes.
Some crashes probably lost, but seems good overall.
Assignee | ||
Comment 9•8 years ago
|
||
I generated a list of crash ids from the last load test and verified that we have crash data on s3 for every crash we expected to be there. So that's good.
Code for that script is in https://github.com/willkg/antenna-log-parser .
Assignee | ||
Comment 10•8 years ago
|
||
So where are we at now?
About Datadog:
The numbers in Datadog (# incoming, # saved) don't match what we're seeing from the logs and when there's a lot of load on the system, Datadog numbers are pretty far off. We shouldn't use Datadog counters for precision measurements. I think we can use them for indicators that something is horribly wrong which is good for monitors. Making this better is either not possible or non-trivial and probably requires using a different service/system.
About the logs on the grephost:
At first, I was concerned the logs on the grephost weren't a good source of truth. However, I think these tests suggests that I can be confident in the grephost again. The one issue we had on 3-14-2017 seems to be unique. I don't know why it happened, but it hasn't happened since. I think we can stop pulling logs from nodes now.
About data loss:
Of the above analyzed load tests, only one suggested in the logs that it received crashes that were never saved. I verified that the crashes that the logs said were received but not saved didn't make it to s3, so the logs are correct. That load test had > 3x load and two deploys during the load test.
Of the 613786 crashes submitted, 32 didn't make it. That's a 99.99995% success.
Antenna may lose crashes if we're scaling down. Will tweaked Antenna so that it prevents Gunicorn from shutting it down if there are crashes to be saved. Miles made autoscaling down a little slower.
Antenna may lose crashes during a deploy. During "normal load" periods, this should be fine. We might lose crashes during "high load" periods. This is still a low chance--good deploys shouldn't lose crashes.
We have autoscaling set up such that Antenna only scales up in significant load situations and scales down after everything is cool again. This probably won't happen very often.
Having said all that, if you look at the load numbers Socorro currently has (1500 req/m during normal times and 3200 req/m when one of those betas go out with the doorhangers that say "submit all your unsubmitted crashes now!"), the load tests suggest that Antenna can handle this load without scaling at all. We would need a *significant event* to cause Antenna to scale up at all.
I think we can say "Under normal operation, Antenna won't drop crashes."
I think we can say "Under normal peak load, Antenna won't drop crashes."
I think we can say "Under heavy load and atypical circumstances, Antenna might drop some crashes."
Given that, I think we're good here. Going to mark this as FIXED since the underlying question has been worked through.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 11•8 years ago
|
||
Switching Antenna bugs to Antenna component.
Component: General → Antenna
You need to log in
before you can comment on or make changes to this bug.
Description
•