Closed Bug 1059325 Opened 9 years ago Closed 4 years ago

[Meta] Improve monitoring/alert coverage of treeherder (eg New Relic, Nagios, CloudWatch)

Categories

(Tree Management :: Treeherder: Infrastructure, task, P3)

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: emorley, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: meta)

This bug is for the long term fix for this problem, since even if this particular instance clears up, getting this much behind is a tree-closing event & so something that blocks switching to treeherder.
To clarify: looking at say mozilla-inbound on treeherder right now, I see jobs still marked as running as far back as https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=2ccb65865db7 - even though they appear completed on TBPL & buildapi says they finished ~10 hours ago.
This backlog in the data ingestion is due to a problem we had on production since the last code push yesterday. I saw an error in the routine that processes the incoming jobs but it was not present on dev/stage. Everything got fixed pushing the chief red button without any code change. One way to mitigate this kind of problems would be to have a staging environment that matches the production architecture in my opinion. Also, we could write a command to re-process those jobs that resulted in a failure during the ingestion in the last week or so.
Priority: P1 → P2
Summary: Data ingestion for multiple repos is many hours behind → Set up alerts for when data ingestion is failing and/or backlogged
Priority: P2 → P3
Now that we have access to newrelic, this isn't a regression over TBPL (we're already better off than having to read the import logs on tbpl-dev/cache/...).
Blocks: 1076750, 1074927
No longer blocks: treeherder-dev-transition
Guess we need to decide whether newrelic alerts are sufficient, or if we should get coverage via nagios too?
Blocks: 1075799
We should instrument the rabbitmq instance on the admin node to report to new relic.
In that way we would be able to see how many tasks we have in each queue, etc.
Maybe :fubar can help us?
Flags: needinfo?(klibby)
Ah great idea :-)
As it happens, we are/were missing monitoring pieces for treeherder in nagios. :-(  I've added standard http/s checks for zeus and the webheads. Currently looking at what nagios can do for rabbitmq, etc.

newrelic is already set up on the admin node
Flags: needinfo?(klibby)
Making this bug more generic, since it sounds like we need to increase coverage for things other than data ingestion too.

(In reply to Mauro Doglio [:mdoglio] from comment #5)
> We should instrument the rabbitmq instance on the admin node to report to
> new relic.
> In that way we would be able to see how many tasks we have in each queue,
> etc.

Using something like this?
http://newrelic.com/plugins/pivotal/95
Priority: P3 → P2
Summary: Set up alerts for when data ingestion is failing and/or backlogged → Improve Nagios & New Relic coverage of treeherder
Blocks: 1072681
Depends on: 1076737
Blocks: 1080757
No longer blocks: 1072681
No longer blocks: 1080757
Component: Treeherder → Treeherder: Infrastructure
Depends on: 1093757
Keywords: meta
Summary: Improve Nagios & New Relic coverage of treeherder → [Meta] Improve Nagios & New Relic coverage of treeherder
Depends on: 1124278
Depends on: 1125395
Depends on: 1125569
Depends on: 1127774
Depends on: 1131130
Depends on: 1131171
Depends on: 1131240
Depends on: 1131244
Depends on: 1131247
Depends on: 1131394
Priority: P2 → P3
Depends on: 1141036
Depends on: 1141993
Depends on: 1076886
Depends on: 1165229
Depends on: 1191080
Depends on: 1200379
Depends on: 1201086
Depends on: 1223450
Depends on: 1223496
Depends on: 1225504
Depends on: 1276249
QA Contact: laura
Depends on: 1281850
Depends on: 1284289
Depends on: 1287950
Summary: [Meta] Improve Nagios & New Relic coverage of treeherder → [Meta] Improve monitoring/alert coverage of treeherder (eg New Relic, Nagios, CloudWatch)
Depends on: 1306597
Depends on: 1176412
Depends on: 1201063
Depends on: 1307465
Depends on: 1308549
Depends on: 1336276
Depends on: 1340132, 1340123
Depends on: 1340203
Depends on: 1340216
Depends on: 1346204
Depends on: 1354484
Depends on: 1357538
Depends on: 1371264
Depends on: 1373245
Assignee: nobody → emorley
Depends on: 1387475
Depends on: 1387487
Depends on: 1387543
Depends on: 1387556
Depends on: 1387642
Depends on: 1393194
Depends on: 1397727
Depends on: 1413891
Assignee: emorley → nobody
QA Contact: laura
I've just updated the Heroku metrics alerts (see https://devcenter.heroku.com/articles/metrics#threshold-alerting) to go to treeherder-internal@ rather than only me.
Depends on: 1439368
Depends on: 1463709
Depends on: 1483301
Depends on: 1503576
No longer depends on: 1463709
Depends on: 1513506
Type: defect → task

The two bugs are within the components' queue and it is not massive.
If we ever plan to tackle them they're filed. No need for a meta bug for only two bugs.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.