Closed
Bug 1059325
Opened 9 years ago
Closed 4 years ago
[Meta] Improve monitoring/alert coverage of treeherder (eg New Relic, Nagios, CloudWatch)
Categories
(Tree Management :: Treeherder: Infrastructure, task, P3)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: emorley, Unassigned)
References
(Depends on 1 open bug)
Details
(Keywords: meta)
This bug is for the long term fix for this problem, since even if this particular instance clears up, getting this much behind is a tree-closing event & so something that blocks switching to treeherder.
Reporter | ||
Comment 1•9 years ago
|
||
To clarify: looking at say mozilla-inbound on treeherder right now, I see jobs still marked as running as far back as https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=2ccb65865db7 - even though they appear completed on TBPL & buildapi says they finished ~10 hours ago.
Comment 2•9 years ago
|
||
This backlog in the data ingestion is due to a problem we had on production since the last code push yesterday. I saw an error in the routine that processes the incoming jobs but it was not present on dev/stage. Everything got fixed pushing the chief red button without any code change. One way to mitigate this kind of problems would be to have a staging environment that matches the production architecture in my opinion. Also, we could write a command to re-process those jobs that resulted in a failure during the ingestion in the last week or so.
Reporter | ||
Updated•9 years ago
|
No longer blocks: treeherder-sheriff-transition
Reporter | ||
Updated•9 years ago
|
Blocks: treeherder-dev-transition
Priority: P1 → P2
Summary: Data ingestion for multiple repos is many hours behind → Set up alerts for when data ingestion is failing and/or backlogged
Reporter | ||
Updated•9 years ago
|
Priority: P2 → P3
Reporter | ||
Comment 3•9 years ago
|
||
Now that we have access to newrelic, this isn't a regression over TBPL (we're already better off than having to read the import logs on tbpl-dev/cache/...).
Reporter | ||
Comment 4•9 years ago
|
||
Guess we need to decide whether newrelic alerts are sufficient, or if we should get coverage via nagios too?
Comment 5•9 years ago
|
||
We should instrument the rabbitmq instance on the admin node to report to new relic. In that way we would be able to see how many tasks we have in each queue, etc. Maybe :fubar can help us?
Flags: needinfo?(klibby)
Reporter | ||
Comment 6•9 years ago
|
||
Ah great idea :-)
Comment 7•9 years ago
|
||
As it happens, we are/were missing monitoring pieces for treeherder in nagios. :-( I've added standard http/s checks for zeus and the webheads. Currently looking at what nagios can do for rabbitmq, etc. newrelic is already set up on the admin node
Flags: needinfo?(klibby)
Reporter | ||
Comment 8•9 years ago
|
||
Making this bug more generic, since it sounds like we need to increase coverage for things other than data ingestion too. (In reply to Mauro Doglio [:mdoglio] from comment #5) > We should instrument the rabbitmq instance on the admin node to report to > new relic. > In that way we would be able to see how many tasks we have in each queue, > etc. Using something like this? http://newrelic.com/plugins/pivotal/95
Priority: P3 → P2
Summary: Set up alerts for when data ingestion is failing and/or backlogged → Improve Nagios & New Relic coverage of treeherder
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Updated•9 years ago
|
Component: Treeherder → Treeherder: Infrastructure
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Updated•9 years ago
|
Priority: P2 → P3
Updated•7 years ago
|
QA Contact: laura
Reporter | ||
Updated•7 years ago
|
Summary: [Meta] Improve Nagios & New Relic coverage of treeherder → [Meta] Improve monitoring/alert coverage of treeherder (eg New Relic, Nagios, CloudWatch)
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Updated•6 years ago
|
Assignee: nobody → emorley
Reporter | ||
Updated•6 years ago
|
Assignee: emorley → nobody
QA Contact: laura
Reporter | ||
Comment 9•6 years ago
|
||
I've just updated the Heroku metrics alerts (see https://devcenter.heroku.com/articles/metrics#threshold-alerting) to go to treeherder-internal@ rather than only me.
Updated•4 years ago
|
Type: defect → task
Comment 10•4 years ago
|
||
The two bugs are within the components' queue and it is not massive.
If we ever plan to tackle them they're filed. No need for a meta bug for only two bugs.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•