Closed Bug 1343708 Opened 7 years ago Closed 7 years ago

change metrics in antenna to be batched [antenna]

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(2 files)

Link to Github pull-request: https://github.com/mozilla/antenna/pull/173#attch-to-bugzilla 7 years ago Will Kahn-Greene [:willkg] ET needinfo? me 61 bytes, text/plain		Details
Link to Github pull-request: https://github.com/mozilla/antenna/pull/202 7 years ago Will Kahn-Greene [:willkg] ET needinfo? me 43 bytes, text/x-github-pull-request		Details \| Review

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

7 years ago

We're having problems with Antenna -dev where I'll post 10 crashes and one graph tells me "yay! 10 crashes were received!" and another graph will say "yay! 1 crash was saved!" or sometimes "yay! 0.25 crashes were saved!" We check the logs on the node and can clearly see all 10 crashes were saved. So it's a case of the graph being wrong.

The current theory is that the mystery is related to how statsd does counter values that occur in the same second. Because the saves happen so fast and the value is always "1", they're getting whatevered and then we see weird data in the graph.

Related articles to possibly support that theory:

https://help.datadoghq.com/hc/en-us/articles/204271195-Why-is-a-counter-metric-being-displayed-as-a-decimal-value-

http://docs.datadoghq.com/guides/metrics/#counters


This bug covers changing the code to batch the counter data and send it as part of the heartbeat.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

7 years ago

Assignee: nobody → willkg

Status: NEW → ASSIGNED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 1

•

7 years ago

Attached file Link to Github pull-request: https://github.com/mozilla/antenna/pull/173#attch-to-bugzilla — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 2

•

7 years ago

I landed the changes.

I'll wait to they make it to -dev and then make sure they work.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 3

•

7 years ago

Batching seems to be working.

Looking at the Datadog graphs, it's clear the save_crash.count metric is definitely being rate-ified (whatever that means) by Datadog or the agent or something after Antenna sends the data. If we take the value after .as_count() and multiply it by 10 (which coincides with the heartbeat interval), then it lines up fine. I have no idea how to fix that so it's correct, but seems like a problem on Datadog's side? Maybe we should try renaming the counter key again?

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

7 years ago

The graphs all look good. Batching is working fine.

Given that, I'm going to mark this closed.

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 5

•

7 years ago

Reopening this.

I've learned a lot more about statsd in the last month, enough to claim that the theories in this bug are horseshit.

I incorrectly thought that "normalized by default at a per second rate" possibly meant that if we're reporting the same metric with the same value a lot in a second, that some of the data gets dropped. That's wrong.

Instead, what's going on is probably one of a few things:

1. in high load situations, some of the UDP packets are getting dropped

2. comparing incoming to saved is tricky because the counting happens at different times, so it's possible that the numbers end up off a bit

3. when Antenna is scaling down or doing a deployment, we're pretty sure we can lose unsent measurement information (logs, statsd) when the node goes away

Batching counters doesn't help in any of these situations.

Given that, I'm removing the batch_incr code because it's probably not helping (despite the claims in comment #3) and thus it's just adding unneeded code complexity.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 6

•

7 years ago

Attached file Link to Github pull-request: https://github.com/mozilla/antenna/pull/202 — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 7

•

7 years ago

I landed the removal of the batch_incr code. It deployed to -dev. I checked the dashboard and it's showing the same weird behavior it was before.

Also, I spent some quality time in the datadogpy issue tracker to see if I could get a better understanding of things.

This explains the "counts are off by ten" mystery:

https://github.com/DataDog/dd-agent/issues/659#issuecomment-24875483

Given that, I'm going to call this WONTFIX because we really don't want to batch the data--that won't help.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → WONTFIX

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 8

•

7 years ago

Switching Antenna bugs to Antenna component.

Component: General → Antenna

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

change metrics in antenna to be batched [antenna]

Categories

(Socorro :: Antenna, task)

Tracking

(Not tracked)

People

(Reporter: willkg, Assigned: willkg)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Attachment

General

Description

File Name

Content Type