Closed Bug 1181742 Opened 7 years ago Closed 7 years ago

Investigate monitoring solution for decision task

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Unassigned)

References

Details

Currently we have things in place for alerting based on threshold via papertrail, influx db metrics, grafana graphs, status page for taskcluster components, and a few other things to get insight into the taskcluster infrastructure. 

However all of this is based on either things getting into the papertrail logs or someone viewing the various dashboards we have.  There should be a service that can monitor the health of taskcluster and alert the necessary people when thresholds/issues are reached.

Some use cases are for when:
1. provisioner stops provisioning nodes for some reason (no noticeable alerts get into the logs)
2. mozilla-taskcluster was no longer creating decision tasks

Some ideas:
1. custom service with a rule based system for monitoring these things
2. nagios container deployed using tutum to setup monitoring of various things.  I believe there are monitoring agents for influxdb as well as the ability to create custom checks.
Component: TaskCluster → General
Product: Testing → Taskcluster
we had serveral tree closing issues where nagios/monitoring could have helps to catch this issues earlier.

So far i was told:

"monitoring? we don't have a real monitoring solution for any tc components right onw :("

so we need monitoring.
Blocks: 1080265
:Tomcat, do you have specifics of things that were down that are not part of the use cases above? (provisioner stops provisioning and mozilla-taskcluster not creating decision tasks)

Also in addition to the above, I would like to monitor pending times and alert when there is a growing backlog.

I would like to get the entire picture of what we would like to monitor to come up with the best solution.
hey Garndt, one was during the tree closure last friday:

wcosta:	hrm, so, we have two problems here: the artifact is not there, tc-vcs doesn't handle it appropriately. One approach would be tc-vcs falls back to repo cloning where cache fails. This would avoid tree close, but it would take a lot more time for people to detect there is something wrong
12:25	jhford	or if we had an alerting system, we could have an alert every time it fell back and if it's doing more than X fallbacks per hour we send an email
We've implemented a provisioning monitor and restart mechanism. Since doing that a month or so ago, we've not had monitoring concerns about tasks not launching.  Is the original purpose of this bug resolved at this point? I realize we have other monitoring issues, but those are being tracked elsewhere.
Flags: needinfo?(garndt)
Summary: Investigate monitoring solution for taskcluster services → Investigate monitoring solution for decision task
We have a few things in place:

Provisioner checkin using third party service (dead mans snitch)
taskcluster-monitor [1] that both monitors the provisioner iterations as well as checks influx for pending task counts for the gecko decision worker type

I believe this could be closed as we have some adequate monitoring in place now to detect things going down ahead of time.

[1] https://github.com/taskcluster/taskcluster-monitor/blob/master/config/alerts/influx.yml#L2
Flags: needinfo?(garndt)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.