Filing this as a meta bug for the work discussed at various Whistler meetings.
Currently when someone classifies a failure on Treeherder, a few things happen:
* The classification is stored in Treeherder's DB (in the bug_job_map and job_note tables; of which there is one per repository).
* The classification is also submitted to ElasticSearch in a much less useable form, since OrangeFactor only consumes from ElasticSearch for legacy reasons.
* A bug comment is added to the bug (if a number was provided) - one for each failure.
Problems with the current approach:
* We're duplicating data, which is daft (this was only intended as a short term interim step).
* We're spamming Bugzilla, causing DB bloat and perf issues (both in bug rendering and also DB text search times) - and the BMO team *really* wants us to turn them off.
* The data submitted to ElasticSearch is in a much less useable form than that in the Treeherder DB (eg not all of the job properties are available, the bug numbers have to be scraped from strings etc) so it limits the improvements that can be made to OrangeFactor.
* OrangeFactor doesn't have notifications & makes it hard to answer certain questions, so people rely on unread bugmail counts & local python scripts to scrape bug comments, to fill the gaps.
In addition, we now want:
* Intermittent failure bugs to be more actionable - namely by only filing bugs for the top N issues and not all failures (ie similar to the crashstats topcrash approach).
* Classifications to be made automatically by Treeherder in the vast majority of cases, so humans are not acting as expensive pattern matchers hundreds of times a day. (ie bug 1177519).
Implications of this:
a) If we only file bugs for the top issues means we can no longer use the bug number as the ID for each failure, and Treeherder needs to track more state internally, since Bugzilla will not know about all previous failures.
b) To automatically classify failures, we need a better false-positive rate than the current simplistic bug summary string matching algorithm.
c) It's not worth making any significant changes to OrangeFactor (eg switching to using the Treeherder DB) until we've figured out the model/workflow changes due to (a)/(b), since we're going to need some significant schema changes.
I think the plan will be something roughly like:
1) Determine which workflows are using the bug comments, so we can work around them short term and not block turning bug comments off. eg we could use a script to query orangefactor and post a daily/weekly update on intermittent failure bugs to replace the "looking at unread bugmails" workflow.
2) Turn off bug comments as soon as possible.
3) Determine the best way to generate a signature for a failure (partly discussed at Whistler).
4) Make any changes to the test harnesses to support this (eg bug 1177630).
5) Plan the Treeherder schema for tracking this (likely one table for all repos, rather than the current multi-db approach; thus this might depend on the rather large bug 1178641 - though guess we could switch to single table just for intermittent failures first?).
6) Make the required changes to Treeherder (both backend and UX for the frontend).
7) Figure out a migration strategy for existing bug classification data.
8) Implement a way to automatically classify failures in Treeherder, initially in parallel with human classification.
9) Run stats on the automatic classifications to see how reliable they are.
10) Implement a Treeherder UI for reviewing automatic classifications on a periodic basis.
11) Switch over to using automatic classifications in the vast majority of cases.
-> profit \o/
The numbered steps in comment 0 also need to include:
* Decide how to continue to support OrangeFactor as we make the schema changes (eg just continue to sync only the failures that have bug numbers to ElasticSearch, short term?)
* Create appropriate APIs for an OrangeFactor v2
* Create an OrangeFactor v2 to use these APIs - that also doesn't assume that every failure has a bug number (I think there is still use in being able to view all failures in OrangeFactor; we can just make sure the default views de-emphasise failures with no bug/few occurrences).
* Hopefully pick a clearer name than "OrangeFactor v2" ("Orange" is quite an in-house term; IntermittentHerder? haha)