Closed Bug 1343708 Opened 7 years ago Closed 7 years ago

change metrics in antenna to be batched [antenna]

Categories

(Socorro :: Antenna, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(2 files)

We're having problems with Antenna -dev where I'll post 10 crashes and one graph tells me "yay! 10 crashes were received!" and another graph will say "yay! 1 crash was saved!" or sometimes "yay! 0.25 crashes were saved!" We check the logs on the node and can clearly see all 10 crashes were saved. So it's a case of the graph being wrong.

The current theory is that the mystery is related to how statsd does counter values that occur in the same second. Because the saves happen so fast and the value is always "1", they're getting whatevered and then we see weird data in the graph.

Related articles to possibly support that theory:

https://help.datadoghq.com/hc/en-us/articles/204271195-Why-is-a-counter-metric-being-displayed-as-a-decimal-value-

http://docs.datadoghq.com/guides/metrics/#counters


This bug covers changing the code to batch the counter data and send it as part of the heartbeat.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
I landed the changes.

I'll wait to they make it to -dev and then make sure they work.
Batching seems to be working.

Looking at the Datadog graphs, it's clear the save_crash.count metric is definitely being rate-ified (whatever that means) by Datadog or the agent or something after Antenna sends the data. If we take the value after .as_count() and multiply it by 10 (which coincides with the heartbeat interval), then it lines up fine. I have no idea how to fix that so it's correct, but seems like a problem on Datadog's side? Maybe we should try renaming the counter key again?
The graphs all look good. Batching is working fine.

Given that, I'm going to mark this closed.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reopening this.

I've learned a lot more about statsd in the last month, enough to claim that the theories in this bug are horseshit.

I incorrectly thought that "normalized by default at a per second rate" possibly meant that if we're reporting the same metric with the same value a lot in a second, that some of the data gets dropped. That's wrong.

Instead, what's going on is probably one of a few things:

1. in high load situations, some of the UDP packets are getting dropped

2. comparing incoming to saved is tricky because the counting happens at different times, so it's possible that the numbers end up off a bit

3. when Antenna is scaling down or doing a deployment, we're pretty sure we can lose unsent measurement information (logs, statsd) when the node goes away

Batching counters doesn't help in any of these situations.

Given that, I'm removing the batch_incr code because it's probably not helping (despite the claims in comment #3) and thus it's just adding unneeded code complexity.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I landed the removal of the batch_incr code. It deployed to -dev. I checked the dashboard and it's showing the same weird behavior it was before.

Also, I spent some quality time in the datadogpy issue tracker to see if I could get a better understanding of things.

This explains the "counts are off by ten" mystery:

https://github.com/DataDog/dd-agent/issues/659#issuecomment-24875483

Given that, I'm going to call this WONTFIX because we really don't want to batch the data--that won't help.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → WONTFIX
Switching Antenna bugs to Antenna component.
Component: General → Antenna
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: