61 bytes, text/plain
43 bytes, text/x-github-pull-request
|Details | Review | Splinter Review|
We're having problems with Antenna -dev where I'll post 10 crashes and one graph tells me "yay! 10 crashes were received!" and another graph will say "yay! 1 crash was saved!" or sometimes "yay! 0.25 crashes were saved!" We check the logs on the node and can clearly see all 10 crashes were saved. So it's a case of the graph being wrong. The current theory is that the mystery is related to how statsd does counter values that occur in the same second. Because the saves happen so fast and the value is always "1", they're getting whatevered and then we see weird data in the graph. Related articles to possibly support that theory: https://help.datadoghq.com/hc/en-us/articles/204271195-Why-is-a-counter-metric-being-displayed-as-a-decimal-value- http://docs.datadoghq.com/guides/metrics/#counters This bug covers changing the code to batch the counter data and send it as part of the heartbeat.
Created attachment 8842970 [details] Link to Github pull-request: https://github.com/mozilla/antenna/pull/173#attch-to-bugzilla
I landed the changes. I'll wait to they make it to -dev and then make sure they work.
Batching seems to be working. Looking at the Datadog graphs, it's clear the save_crash.count metric is definitely being rate-ified (whatever that means) by Datadog or the agent or something after Antenna sends the data. If we take the value after .as_count() and multiply it by 10 (which coincides with the heartbeat interval), then it lines up fine. I have no idea how to fix that so it's correct, but seems like a problem on Datadog's side? Maybe we should try renaming the counter key again?
The graphs all look good. Batching is working fine. Given that, I'm going to mark this closed.
Reopening this. I've learned a lot more about statsd in the last month, enough to claim that the theories in this bug are horseshit. I incorrectly thought that "normalized by default at a per second rate" possibly meant that if we're reporting the same metric with the same value a lot in a second, that some of the data gets dropped. That's wrong. Instead, what's going on is probably one of a few things: 1. in high load situations, some of the UDP packets are getting dropped 2. comparing incoming to saved is tricky because the counting happens at different times, so it's possible that the numbers end up off a bit 3. when Antenna is scaling down or doing a deployment, we're pretty sure we can lose unsent measurement information (logs, statsd) when the node goes away Batching counters doesn't help in any of these situations. Given that, I'm removing the batch_incr code because it's probably not helping (despite the claims in comment #3) and thus it's just adding unneeded code complexity.
Created attachment 8856263 [details] [review] Link to Github pull-request: https://github.com/mozilla/antenna/pull/202
I landed the removal of the batch_incr code. It deployed to -dev. I checked the dashboard and it's showing the same weird behavior it was before. Also, I spent some quality time in the datadogpy issue tracker to see if I could get a better understanding of things. This explains the "counts are off by ten" mystery: https://github.com/DataDog/dd-agent/issues/659#issuecomment-24875483 Given that, I'm going to call this WONTFIX because we really don't want to batch the data--that won't help.
Switching Antenna bugs to Antenna component.