Add a 'machine failures' page to treeherder, based on Ouija

RESOLVED WONTFIX

Status

P5
normal
RESOLVED WONTFIX
4 years ago
2 years ago

People

(Reporter: dminor, Unassigned)

Tracking

Details

(Reporter)

Description

4 years ago
Ouija (see bug 909735, running instance at http://54.215.155.53/) has a slave failures page that allows each slave's failure rate to be determined and compared to the "expected" failure rate for a slave that has run that number of jobs. It also displays the number of jobs since last success.

This allows problematic slaves to be identified and removed from the pool. This tool has been useful to the sheriffs in the past and would see more use if it were part of treeherder.
Blocks: 1054977
OS: Linux → All
Priority: -- → P4
Hardware: x86_64 → All
Tweaking summary to make this bug more clearly different from bug 1087532 (I misread it a few times).

Since bug 1087532 is more practical short term, I'll make that block the TBPL EOL bug, rather than this one.
No longer blocks: 1054977
Summary: Port Ouija 'slave failures' page to treeherder → Add a 'slave failures' page to treeherder, based on Ouija
Summary: Add a 'slave failures' page to treeherder, based on Ouija → Add a 'machine failures' page to treeherder, based on Ouija
(Reporter)

Updated

4 years ago
Blocks: 1110775
(Reporter)

Comment 2

4 years ago
Ed, I think the backend piece of this is a good candidate for the big/open data project we've been talking about (I'll file a separate bug.)

I was wondering if you thought the reporting bit still belonged in treeherder?
Flags: needinfo?(emorley)
I think it probably still does - I think unlike some of the other reports/dasbhoards (eg orangefactor), the "bad machines" report needs to be actioned in near real time, at least for the "runaway machine that's chewing through 100 jobs in N hours" case. Longer term analysis (does this machine have a higher rate of failure over the last 3 weeks and so it might have some bad RAM) could be kept elsewhere though perhaps?
Flags: needinfo?(emorley)
Priority: P4 → P5
Duplicate of this bug: 1156387
In a Taskcluster spot instances AWS world, I don't think the machine failures page really makes much sense.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.