Closed Bug 1059325 Opened 9 years ago Closed 4 years ago
[Meta] Improve monitoring/alert coverage of treeherder (eg New Relic, Nagios, Cloud
This bug is for the long term fix for this problem, since even if this particular instance clears up, getting this much behind is a tree-closing event & so something that blocks switching to treeherder.
To clarify: looking at say mozilla-inbound on treeherder right now, I see jobs still marked as running as far back as https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=2ccb65865db7 - even though they appear completed on TBPL & buildapi says they finished ~10 hours ago.
This backlog in the data ingestion is due to a problem we had on production since the last code push yesterday. I saw an error in the routine that processes the incoming jobs but it was not present on dev/stage. Everything got fixed pushing the chief red button without any code change. One way to mitigate this kind of problems would be to have a staging environment that matches the production architecture in my opinion. Also, we could write a command to re-process those jobs that resulted in a failure during the ingestion in the last week or so.
Priority: P1 → P2
Summary: Data ingestion for multiple repos is many hours behind → Set up alerts for when data ingestion is failing and/or backlogged
Now that we have access to newrelic, this isn't a regression over TBPL (we're already better off than having to read the import logs on tbpl-dev/cache/...).
Guess we need to decide whether newrelic alerts are sufficient, or if we should get coverage via nagios too?
We should instrument the rabbitmq instance on the admin node to report to new relic. In that way we would be able to see how many tasks we have in each queue, etc. Maybe :fubar can help us?
Ah great idea :-)
As it happens, we are/were missing monitoring pieces for treeherder in nagios. :-( I've added standard http/s checks for zeus and the webheads. Currently looking at what nagios can do for rabbitmq, etc. newrelic is already set up on the admin node
Making this bug more generic, since it sounds like we need to increase coverage for things other than data ingestion too. (In reply to Mauro Doglio [:mdoglio] from comment #5) > We should instrument the rabbitmq instance on the admin node to report to > new relic. > In that way we would be able to see how many tasks we have in each queue, > etc. Using something like this? http://newrelic.com/plugins/pivotal/95
Priority: P3 → P2
Summary: Set up alerts for when data ingestion is failing and/or backlogged → Improve Nagios & New Relic coverage of treeherder
Component: Treeherder → Treeherder: Infrastructure
Summary: [Meta] Improve Nagios & New Relic coverage of treeherder → [Meta] Improve monitoring/alert coverage of treeherder (eg New Relic, Nagios, CloudWatch)
I've just updated the Heroku metrics alerts (see https://devcenter.heroku.com/articles/metrics#threshold-alerting) to go to treeherder-internal@ rather than only me.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.