add statsd collection for crash report size in collector

RESOLVED FIXED

Status

Socorro
General
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: willkg, Assigned: willkg)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

The collector WSGI app handles incoming HTTP POST requests which have the crash report as the payload. We know that the size of crash reports has been growing over time. We don't know what sizes we're seeing now and we have no way to look at it historically.

We probably want to get the size for incoming compressed crash reports as well as the reports uncompressed. Both of those numbers are interesting plus they'll also tell us how many crash reports are coming in compressed.

This issue covers adding statsd collection to the collector for incoming crash reports.
Lonnen and I were talking about this and originally thought we should do statsd gauge calls. Lonnen pointed out that Datadog has an extension to statsd that does histograms. That's probably more like what we want.

http://docs.datadoghq.com/guides/dogstatsd/#histograms

If it turns out to be total crap, we can switch from histograms to gauges--no big deal.

Grabbing this to work on now since I need this information for my write up of how the collector works and requirements.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Gauges are the wrong thing, so instead of doing that, I'm going to (ab)use timings.

I'm mostly done with this. I have to write tests, verify it all works correctly and address some FIXMEs.
In a PR: https://github.com/mozilla/socorro/pull/3439

I also tweaked it so that it captures data about rejected crash reports and differentiates between accepted and rejected data. This will be important if we ever decide to move the throttler out of the collector and into a later portion of the pipeline.
When this lands, it won't have any effect on the system. To enable metrics gathering, we need to do this config change:

consulate kv set socorro/collector/metrics.metrics_class socorro.external.statsd.metrics.StatsdMetrics

It'll start sending data to datadog with the following keys:

crash_report_size_(approved|rejected)_(compressed|uncompressed)

Then we can build a dashboard for it in datadog.
Oops--forgot a key.

Redoing the key from comment #4 so they're in the same comment. Need to add both these keys:

consulate kv set socorro/collector/metrics.metrics_class socorro.external.statsd.metrics.StatsdMetrics
consulate kv set socorro/collector/metrics.statsd_prefix collector.raw_crash

Comment 6

a year ago
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/27f595727a0f10ab5d5c2f460c60b5249d2e7987
Bug 1293776 initial pass on StatsdMetrics (#3439)

* Bug 1293776 initial pass on StatsdMetrics

This creates a new kind of external component called Metrics. It gets
used in the breakpad collector after saving the crash to log data
about the incoming crash at the point we've collected it.

* Bug 1293776 adjust crash size report metrics

This adjusts the crash size report metrics such that they also capture
data about rejected crash reports. Further, it tweaks the data key so
it's easier to intuit what it is in datadog and other similar systems.

* Bug 1293776 Fix integration test

I think this fixes the integration test that Travis runs using the
Collector2015App that we don't use. That app re-arranges where
configuration is located in the collector app hierarchy of components,
so "config.metrics" gets handled differently.

* Bug 1293776 address review comments

* fix docstring
* move the rest of the metrics code into the try/except block
* reduce the work the except block does so as to reduce the likelihood of errors
  in the error handling
(In reply to Will Kahn-Greene [:willkg] from comment #5)
> Oops--forgot a key.
> 
> Redoing the key from comment #4 so they're in the same comment. Need to add
> both these keys:
> 
> consulate kv set socorro/collector/metrics.metrics_class
> socorro.external.statsd.metrics.StatsdMetrics
> consulate kv set socorro/collector/metrics.statsd_prefix collector.raw_crash

r+ to run this on stage.
We merged the code, which triggered a deploy to the -stage environment. I went in and made the consulate configuration changes and waited 15 minutes for the collector uwsgi processes to pick up the new configuration. Then I waited some more time for datadog to start getting data, but no data came as far as I could tell.

I checked the logs on one of the stage collector nodes and saw no errors or anything to indicate anything was awry. I verified the metrics code is on the -stage collector nodes.

Then Peter and I spent some time debugging things. We verified that the configuration keys were correct and that they should have changed the behavior. We verified that the keys made it into consulate. Then we noticed that the uwsgi processes didn't seem to have picked up the configuration.

JP says that sometimes stuff doesn't restart in -stage after consulate changes. His method is to go into AWS and terminate all the nodes which causes AWS to create new nodes. Alternatively, he said we should tie configuration changes to deploys.

So then I merged another PR to trigger a deploy on -stage. I'll wait an hour for that to work itself out. The current theory is that the uwsgi processes never restarted after the configuration changes, so they didn't take effect. A deploy will force that to happen.

That's everything so far.

Things to think about:

1. if it's the case that the collector uwsgi processes don't restart after consulate changes, then we should change the consulate keys before doing a -prod deploy

2. if I can't get this working today, I should either abandon ship (revert the changes) or attempt to get it working on -stage tomorrow, not make any configuration changes to -prod, and use the data gathered from -stage which won't be great, but might be a good enough sampling for accepted crashes for the requirements doc
Turns out -stage was configured to use Collector2015App instead of CollectorApp. I wrote up bug #1297795 to fix that, then fixed it.

After doing the consulate changes, the collector uwsgi processes didn't restart again. I don't think they restart on consulate changes--I have yet to see it happen. Am I super unlucky? Is there something amiss? Is this a problem with -stage and -prod? Is it intended? Only the shadow knows.

Regardless, I can see the data in datadog now. I threw it into this dashboard:

https://app.datadoghq.com/dash/174627/willkg-dashboard-of-meaningful-things

The prefix is wrong. I'll look into that tomorrow.
Two minor updates:

1. The collector web processes *never* restart on consul changes and that's intentional. After making configuration changes that affect the collector web process, we'll have to do a deploy or restart the services by hand on the nodes.

2. I didn't look into the prefix issue, thus this is only configured in -stage still. I don't think I'm going to get to it immediately since the data on -stage seems good enough for my immediate needs.
I figured out why we're not getting prefixes: the dogstatsd client is created using the defaults (localhost and 8125) and doesn't include a key prefix.

I'm not sure how to fix that without disrupting all our data collection, so instead I think I'll just change the keys.
Created attachment 8806566 [details] [review]
Link to Github pull-request: https://github.com/mozilla/socorro/pull/3567

Comment 13

a year ago
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/f1baf95c88ce64f02f49eea8bc7096da54e19c77
Bug 1293776: Fix prefix (#3567)

The dogstatsd client uses the defaults (localhost and 8125 and no key
prefix), so that's why it's ignoring the key prefix we assigned.

I could fix that code, but that would affect all the metrics. Instead,
I'm manually adding the prefix.
I'll verify this on -stage tomorrow, update the datadog dashboards I've got and then when it gets pushed to -prod, I'll enable it there and redo the datadog dashboards again.
Have you verified this is working as expected in stage? (going to prod push later today, so need to know)
Flags: needinfo?(willkg)
I just checked datadog and yes--it's working as expected. We're seeing results coming in for collector.crash_report_size_* rather than just crash_report_size_* like we were previously.

bd
Flags: needinfo?(willkg)
I applied the configuration changes in comment #5 to the -prod environment. The collector web process doesn't restart on consul changes, so we won't see any data until someone does a -prod push or otherwise restarts the web processes. I'll wait for that to happen, but it's probably next week.

I started a Datadog dashboard for these stats. I'll finish that next week when we start getting data.
Someone did a -prod push, so data is getting generated now.

I tossed together a dashboard!

https://app.datadoghq.com/screen/134445/crashreport-sizes--prod

Marking as FIXED.
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.