Closed Bug 1419550 Opened 7 years ago Closed 6 years ago

[ops infra socorro] metrics

Categories

(Socorro :: Infra, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: osmose)

References

Details

We need the ops socorro -stage-new environment hooked up with Datadog and a Datadog dashboard for the -stage-new environment.

Further, Socorro developers need access to this dashboard.

This bug covers setting that up.
The metrics configuration is in place, but I'm not seeing the processor metrics I expect to be seeing.

We have `resource.statsd.statsd_host=localhost` in the environment, and looking at the configuration dump from the processor in in logging, things appear to be in order. The Datadog agent is listening on the correct port, and host related metrics from the processor hosts are showing up - but custom metrics, such as `processor.save_raw_and_processed` are not.
We need us some metrics, so making this a P1.

I'm grabbing this to look into, but I'm not sure I have many ideas. We'll see.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P1
I send like 100 crashes to -stage-new collector to get some recent data so as to look into the metrics problem.

Here's the dashboard for the collector in -stage-new:

https://app.datadoghq.com/dash/400900/socorro-collector-new-stage?live=false&page=0&is_auto=false&from_ts=1511811913653&to_ts=1511813382000&tile_size=m

One thing i notice immediately is that the only data there is from pigeon and the node hosts--there's nothing from the Antenna container.

This is the same thing we're seeing with Socorro:

https://app.datadoghq.com/dash/405076/socorro-new-stage?live=false&page=0&is_auto=false&from_ts=1511811924705&to_ts=1511813391789&tile_size=m

We see data coming from the host, but nothing coming from the Socorro apps (it's harder to tell here because there's *no* data in that org for those keys, yet, whereas with the collector dashboard, we have an Antenna -stage and an Antenna -prod so the keys exist).

It really feels to me like data is getting dropped between the container and the host. Filters? Firewalls? Closed ports? Something like that?

However, last Tuesday, Miles ran some curl/netcat commands and sent data manually from inside the container to the Datadog agent on the host which sent it to Datadog. We could see that data in the Datadog dashboards.

Currently, I'm really puzzled. I have no access to the nodes, so I can only observe the lack of things showing up where they should.

I have a few hand-wavey directions we can go from here:

1. document and re-run the test datadog data send from inside the container
2. write a socorro app that sends a test data incr and uses the environment variables and existing code
3. rework things with markus and start with that

I'd need access to the container for direction 1 and I don't have that. I'm going to defer this a bit and work on other things, but probably go with direction 2 because that seems prudent and less involved than direction 3 (though I want to do that soon anyhow).
Mike: Tagging you with this.
Assignee: willkg → mkelly
Mike and Miles went through a bunch of things earlier today all of which suggested we shouldn't be seeing any problems. Later, Will and Miles went through those things again--again, everything suggested we shouldn't be seeing problems.

Turns out the problem was that the database was missing key data, so the processor wasn't actually successfully processing anything, so it wasn't generating any metrics at all. Once Miles and I fixed the database and ran crontabber jobs that flesh out some of the other tables, then we could process crashes successfully and metrics were generated.

We're all good here now! Yay!
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.