Closed Bug 1377313 Opened 7 years ago Closed 7 years ago

precipitous drop in ELB Request Count in antenna dashboard

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: miles)

Details

If you look at the ELB Request graph in the Antenna -prod dashboard, there's a precipitous drop in ELB Request rate between May 25th to 26th.

All the other graphs look fine--there's no drops. It sure seems like the ELB is under-reporting or something like that.

Datadog dashboard range:

https://app.datadoghq.com/dash/274773/antenna--prod?live=false&page=0&is_auto=false&from_ts=1495256667377&to_ts=1496445531057&tile_size=m&fullscreen=false&tpl_var_type=*

Since the other counts look ok (for example, the rate of incoming crashes), this is a mystery and not a four-alarm fire.

This bug covers looking into it and fixing the graph accordingly.
That's near a deploy. Miles thinks that we switched ELB names and the new ELB isn't reporting in a way that's represented in that graph.

I'm assigning to Miles for further investigation.
Assignee: nobody → miles
Status: NEW → ASSIGNED
Miles and I did some more investigation and I learned a lot more about which metrics are interesting in Datadog and the precipitous drop in ELB Request rate lines up *exactly* with a corresponding precipitous drop in HTTP 4xx responses.

Something was hitting urls like / which return a 404 for a persistent 60 req/s which did not ebb and flow with our normal day/night cycle for longer than we have datadog for. And then it stopped.

We added a graph for all the HTTP error codes to the dashboard so we'll be able to more easily see fluctuations in 4xx.

We don't have a clue what was hitting Socorro like that. Maybe bots? Maybe scans? Maybe bored ancient ones mucking around in the deep abyss?

We're pretty sure our monitoring is correct, though. Given that, I think we can close this out since there really isn't anything else we can do now that's impactful.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
One thing I could do is re-align the 1x line to what it should be now that we're not getting 60 req/s of 4xx errors.

Anyone dislike that idea?
+1
I changed the 1x line from 80 req/s to 25 req/s. That seems to be just above where our load is normally these days.
You need to log in before you can comment on or make changes to this bug.