Closed Bug 1342606 Opened 7 years ago Closed 7 years ago

queue size and active savers metrics are lies [antenna]

Categories

(Socorro :: Antenna, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(1 file)

The BreakpadSubmitterResource.save_queue_size and BreakpadSubmitterResource.active_save_workers metrics are generated and sent to statsd when a GET request for the /__heartbeat__ endpoint happens. That GET request is triggered once a minute by cron on the node and handled by one of the worker processes. Because of that, it's only sending back data for one of the worker processes on the node and currently Antenna is configured to have 5 per node. Because of that, the metrics may or may not be representative of what's going on across processes.

This bug covers figuring out whether that nuance matters. If it doesn't, we should tweak the graph descriptions to note this. If it does, we should come up with a different way to gather those metrics.
Maybe have the process report the metric once a minute on its own via a timer instead of as part of /__heartbeat__ handling? 

Maybe start a greenlet that sleeps, sends data, sleeps sends data, etc?

http://www.gevent.org/gevent.html#gevent.sleep
Grabbing this to look at on Monday.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
If you send the metrics per worker we can sum them / see them by worker and node easily, so that is a solid option. It would be helpful to have per-worker metrics so we can identify why the worker processes that hang around eating memory after the load has elapsed happen.
I'm pretty sure this is correct now. I see data in datadog, but it's all 0s since I'm not putting much load on -dev.

Miles: I'm not sure how to do the stat per worker. How do we do that?
If you pass a tag with your statsd metric with something like PID, we could group by that.
Is it interesting/helpful to add a PID tag to all the data from a worker?
If it’s useful for you. That’s what all this comes down to - but it would help visualize the lifecycle of workers in Datadog at the very least.
What does the distribution of PIDs look like over a time interval?
From irc: lots of unique keys are bad. PIDs are stable enough that it shouldn't be a problem
Just to clarify, we're adding the PID as a tag--not to the key:

http://docs.datadoghq.com/guides/tagging/

So, maybe something like "pid:14"? It's a little weird since the PID alone isn't helpful (we'll have a PID 14 on all the nodes), but the host + PID should be unique. Pretty sure they're already tagged by host courtesy of the datadog client configuration Miles did.

Miles: Does that sound right? ^^^
That all sounds correct.
I haven't done the pid thing. It requires some tweaks to the infrastructure and while it might let us see into specific processes, I think we can get by with out it and I'm not sure it helps us much in the future.

Given that, I'm going to mark this as FIXED now. If we need the pid tag, then we can write up a new bug and implement it then.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Switching Antenna bugs to Antenna component.
Component: General → Antenna
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: