Open Bug 1193512 Opened 4 years ago Updated 4 years ago

OrangeFactor change detection


(Testing :: General, defect)

Not set


(Not tracked)


(Reporter: ekyle, Assigned: dtjh)


from David,

> There is something that I'd like your opinion on:
> I am working on doing the change detection on the OF data.
> Often what happens is there is a significance threshold to set 
> that controls the various trade-offs in the performance (false 
> alarm rate, true detection rate, etc.)
> My perspective is that you guys are probably okay with a 
> moderate level of false alarms (1-5%) but want to capture 
> all the real changes (the false alarms and the true detection 
> rate have an inverse correlation). 
> I am also researching a probabilistic model which provides a 
> level of confidence for the likelihood of a bug undergoing 
> changes (this has not been done before in the research 
> community so I'm quite excited with this part).
> Any insight based on your experience with them would be great.
Assignee: nobody → dtjh
Just to clarify the trade-off between false alarm and true detection a little bit:
With all change detection, the false alarm rate inevitably is worse (goes up) when a better true detection rate is desired.

The idea with the probabilistic model is to give a confidence so that false positives (theoretically more likely to have lower confidence) can be better distinguished from the true detections (more likely to have higher confidence).
I believe test failures come in two recognizable categories: "Real" test failures (regressions) - which indicate code that breaks a test; or "noise" – test failures that can not be attributed to any code change.

The answer to your question depends on whether you are focusing on detecting real regressions, or noise.  If you are trying to detect the noise then I agree with your assumptions.  Detecting real test failures is another matter.

I highly suspect the number of real failures an automation can detect will easily overwhelm our limited resources.  Missing real changes, especially the "small"[1] ones, is tolerable given we may never be able to act on them.  Furthermore, false alarms can also undermine the confidence in the system; although 1% false positive rate may be acceptable, 5% is definitely too large. 

Given a low tolerance for error, you may lack sufficient data to make a timely determination.  This is where I stumbled in the past:  Giving a high quality alert in short order does not seem possible.  Maybe have a low confidence with quick detection; or maybe have a high confidence with slow detection, or maybe you can adjust the error tolerance based on the age of the regression.

The others copied on this bug have a better feel for the features and anomalies of test failures and the OrangeFactor data.  Their feedback will be of greater utility than mine.   

[1] "small" can mean real-but-rare intermittent failure – detectable, but very difficult to reproduce and fix.
I agree, if "change" here is taken to mean a real test failure is identified, then keeping the false alarm rate low is more important than identifying every failure, since as Kyle says, there is a long tail of unimportant (but real) failures.
On an call yesterday, David summarized his work on detecting change in the rate of test failures.  He has a sample of intermittents which he is using to confirm his algorithm; bugs that show sheriffs found the failure important enough to take action*.  He showed a few to me.  His algorithm can highlight changes in the failure rate, sometimes days before the sheriffs take action.  It is hoped that this change detection can be used to generate a list of top regressions, which could be added to the existing OrangeFactor dashboard.

There is plenty more technical details, so please contact David if you have questions. 

*action in this case is a comment on the bug, asking for further investigation into a serious failure.
That's really cool. This will also (eventually) make it possible for autoland to make better decisions on whether a change should be allowed to land or not.

Is there any way we can test it out at home?
The next thing on my list of to-dos is to have a prototype implemented in an offline version of OF. 
After I've done this it will be easier for you guys to test it and easier for me to get feedback from you guys so I can further fine tune & modify the algorithms. 

Unfortunately I will be away for about 2 weeks. I am travelling to Europe to present at a machine learning conference so just a heads up that the implementation will come after that.

Thanks ;)
I believe the next step is to setup a video meeting with David:  He needs input from us, especially the sheriffs, to further refine how the dashboard should prioritize bugs.
Out of interest, is this at a point where you can answer specific questions? For example I think the rate of Wr (Web-Platform-Tests-Reftests) oranges on Linux(64) Debug builds went up at some point in the last few weeks, but I'm not sure when. Are you able to see that change and suggest a revision range to investigate in more detail?
This is a great perspective, thanks.

At the moment it can answer when the rate of oranges went up at some points in a given range of dates for a particular bug. 
I have not thought about monitoring other "groupings" - one of the reasons why I'd like to update & discuss the work to get some more feedback and perspectives that I might have overlooked.

Having said that, it can be done to answer that specific question.
This is neat.
If you could please upload a screencast with what you have so far (even if it is rough) or code we will be better able to give feedback. I won't be attending the calls.
If instead we want to record the call it would be great.
↑ This is in relation to James's question (the graph that disappeared during our meeting).

Machine picked up a change on Oct-1 06:00 so I went back and looked into the past month.
This is the condensed graph showing the frequency & 7 day running average.
You need to log in before you can comment on or make changes to this bug.