Closed Bug 1421983 Opened 8 years ago Closed 5 years ago

Show failure and total job count for job type and all jobs to help identify broken machines / worker instances

Categories

(Tree Management :: Treeherder, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: aryx, Unassigned)

Details

• Worker instances can fail permanently for infrastructure issues, e.g. ◦ because their disk is full or there are write access issues – this will cause dozens or hundreds of failures if it doesn’t get quarantined or terminated quickly ◦ for all jobs when a common feature gets requested ◦ only for some job types, often related to graphics, e.g. webgl (“gl”) and reftests (“R”) • It’s hard to identify test failures as such broken machines, often one has to remember that the test might check something which unveils that (e.g. checks the whole screen for anormal pixel values at a position) and to verify that by clicking through to the view which lists the history of past jobs in the provisioner (see below). • Treeherder shall show e.g. a line “Failures: Test type: <failure count for jobs run for this test type>/<number of jobs run for this test tpye> All: <total failure count>/<total job count>” and linkify it with a link to such a list like below and where one can terminate such a machine. • Automating this is tricky because one has to filter out general infrastructure issues (e.g. download failures) and bustage and test failures caused by code changes. • Example: 1. Open https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=a470c07c1474894bff2da6b06123ca4144d125e8&group_state=expanded&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&selectedJob=148545155 2. In the bottom toolbar, click the “...” button. 3. Select “Inspect task”. 4. Click on “Task Run 0 (Latest)”. 5. Click on the linked WorkerId. 6. Notice that the webgl tests have been failing on this machine.
I think this might be something best handled by the system that scheduled the jobs? Greg, I don't suppose you know (or could redirect this to someone who might) whether this is possible for Taskcluster already, or on the roadmap?
Flags: needinfo?(garndt)
Currently this is not possible due to the way our datastore works. In the future it might be possible because we are investigating moving to a relational DB that would allow queries like this to be performed, but I still can't guarantee how easy it would be to get the information. Something to definitely consider while designing things though. We do keep an ancillary table with the last 20 jobs per machine as a temporary measure, but that would not be helpful in this case due to the shorten job history we maintain. If 20 jobs is enough for now, there is an API for these jobs I believe and TH could calculate the failure rate for that given machine to display.
Flags: needinfo?(garndt)

it looks like dashboards have been made (see bug 1617552) to meet this need.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in before you can comment on or make changes to this bug.