Analyze build data to identify rogue test machines

RESOLVED DUPLICATE of bug 909735

Status

RESOLVED DUPLICATE of bug 909735
6 years ago
4 years ago

People

(Reporter: dminor, Unassigned)

Tracking

Details

(Reporter)

Description

6 years ago
Looking at smoketest data, it appears that a disproportionate amount of failures occur on individual build machines. It would be nice to have a tool that allows for automated detection of "rogue" machines.

A first step towards this would be to gather data and present it in a way which makes it easier for humans to identify rogue machines. Once this is complete, it may be possible to automate this. This would allow for rogue machines to repaired and could be used to reduce noise in data about intermittent failures.

Discussion on etherpad: https://etherpad.mozilla.org/84b6xenAaY
(Reporter)

Comment 1

6 years ago
It looks like orange factor already the data available to do this, through its failures by machine view.

We would also need total runs / machine in the time period to usefully identify a rogue machine. It would also be nice to classify failures by type: e.g. timeouts, crash, intermittents. We can do this to a limited extent by bug type.

It might also be nice to graph the failures over the time period to get a better visual feeling for what is going on. I did something like this for smoketests, example here: http://people.mozilla.com/~dminor/stuff/events.png.
(Reporter)

Updated

6 years ago
Summary: Analyze build data to identify rogue build machines → Analyze build data to identify rogue test machines
I think we can add some additional logic into the orange factor log parsers to do this for us.  

I would like to know if the log parsers work on all the logs specified in:
http://hg.mozilla.org/automation/logparser/file/3bc18333205a/logparser/config.py

My assumption is they do and we could add some additional passes/output from the parsers to do this.  

Most of the data is there and we could create our own view/page to help us figure this out.  Likewise we could work on creating a time sequence graph to help visualize what is happening!

Comment 3

5 years ago
Deprecating Testing :: Infrastructure, and sounds like this is a job for Orange Factor anyway.

(I'll try to take a look at this at some point before too long if it's still deemed important-ish.)
Component: Infrastructure → Orange Factor
(Reporter)

Comment 4

5 years ago
I think it makes sense to close this as a duplicate of bug 909735, which is aiming to accomplish the same thing but is more accessible to contributors.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 909735
(Assignee)

Updated

4 years ago
Product: Testing → Tree Management
You need to log in before you can comment on or make changes to this bug.