Ouija (see bug 909735, running instance at http://126.96.36.199/) has a slave failures page that allows each slave's failure rate to be determined and compared to the "expected" failure rate for a slave that has run that number of jobs. It also displays the number of jobs since last success. This allows problematic slaves to be identified and removed from the pool. This tool has been useful to the sheriffs in the past and would see more use if it were part of treeherder.
OS: Linux → All
Priority: -- → P4
Hardware: x86_64 → All
Tweaking summary to make this bug more clearly different from bug 1087532 (I misread it a few times). Since bug 1087532 is more practical short term, I'll make that block the TBPL EOL bug, rather than this one.
No longer blocks: 1054977
Summary: Port Ouija 'slave failures' page to treeherder → Add a 'slave failures' page to treeherder, based on Ouija
Summary: Add a 'slave failures' page to treeherder, based on Ouija → Add a 'machine failures' page to treeherder, based on Ouija
Ed, I think the backend piece of this is a good candidate for the big/open data project we've been talking about (I'll file a separate bug.) I was wondering if you thought the reporting bit still belonged in treeherder?
I think it probably still does - I think unlike some of the other reports/dasbhoards (eg orangefactor), the "bad machines" report needs to be actioned in near real time, at least for the "runaway machine that's chewing through 100 jobs in N hours" case. Longer term analysis (does this machine have a higher rate of failure over the last 3 weeks and so it might have some bad RAM) could be kept elsewhere though perhaps?
In a Taskcluster spot instances AWS world, I don't think the machine failures page really makes much sense.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.