Closed Bug 1342606 Opened 7 years ago Closed 7 years ago

queue size and active savers metrics are lies [antenna]

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(1 file)

Link to Github pull-request: https://github.com/mozilla/antenna/pull/168 7 years ago Will Kahn-Greene [:willkg] ET needinfo? me 43 bytes, text/x-github-pull-request		Details \| Review

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

7 years ago

The BreakpadSubmitterResource.save_queue_size and BreakpadSubmitterResource.active_save_workers metrics are generated and sent to statsd when a GET request for the /__heartbeat__ endpoint happens. That GET request is triggered once a minute by cron on the node and handled by one of the worker processes. Because of that, it's only sending back data for one of the worker processes on the node and currently Antenna is configured to have 5 per node. Because of that, the metrics may or may not be representative of what's going on across processes.

This bug covers figuring out whether that nuance matters. If it doesn't, we should tweak the graph descriptions to note this. If it does, we should come up with a different way to gather those metrics.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 1

•

7 years ago

Maybe have the process report the metric once a minute on its own via a timer instead of as part of /__heartbeat__ handling? 

Maybe start a greenlet that sleeps, sends data, sleeps sends data, etc?

http://www.gevent.org/gevent.html#gevent.sleep

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 2

•

7 years ago

Grabbing this to look at on Monday.

Assignee: nobody → willkg

Status: NEW → ASSIGNED

Miles Crabill [:miles]

Comment 3

•

7 years ago

If you send the metrics per worker we can sum them / see them by worker and node easily, so that is a solid option. It would be helpful to have per-worker metrics so we can identify why the worker processes that hang around eating memory after the load has elapsed happen.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

7 years ago

Attached file Link to Github pull-request: https://github.com/mozilla/antenna/pull/168 — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 5

•

7 years ago

I'm pretty sure this is correct now. I see data in datadog, but it's all 0s since I'm not putting much load on -dev.

Miles: I'm not sure how to do the stat per worker. How do we do that?

Miles Crabill [:miles]

Comment 6

•

7 years ago

If you pass a tag with your statsd metric with something like PID, we could group by that.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 7

•

7 years ago

Is it interesting/helpful to add a PID tag to all the data from a worker?

Miles Crabill [:miles]

Comment 8

•

7 years ago

If it’s useful for you. That’s what all this comes down to - but it would help visualize the lifecycle of workers in Datadog at the very least.

Lonnen :lonnen

Comment 9

•

7 years ago

What does the distribution of PIDs look like over a time interval?

Lonnen :lonnen

Comment 10

•

7 years ago

From irc: lots of unique keys are bad. PIDs are stable enough that it shouldn't be a problem

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 11

•

7 years ago

Just to clarify, we're adding the PID as a tag--not to the key:

http://docs.datadoghq.com/guides/tagging/

So, maybe something like "pid:14"? It's a little weird since the PID alone isn't helpful (we'll have a PID 14 on all the nodes), but the host + PID should be unique. Pretty sure they're already tagged by host courtesy of the datadog client configuration Miles did.

Miles: Does that sound right? ^^^

Miles Crabill [:miles]

Comment 12

•

7 years ago

That all sounds correct.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 13

•

7 years ago

I haven't done the pid thing. It requires some tweaks to the infrastructure and while it might let us see into specific processes, I think we can get by with out it and I'm not sure it helps us much in the future.

Given that, I'm going to mark this as FIXED now. If we need the pid tag, then we can write up a new bug and implement it then.

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 14

•

7 years ago

Switching Antenna bugs to Antenna component.

Component: General → Antenna

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

queue size and active savers metrics are lies [antenna]

Categories

(Socorro :: Antenna, task)

Tracking

(Not tracked)

People

(Reporter: willkg, Assigned: willkg)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Attachment

General

Description

File Name

Content Type