Today's daily/weekly OrangeFactor Robot bug comments look like: NN automation job failures were associated with this bug (yesterday|in the last 7 days). Repository breakdown: ... Platform breakdown: ... For more details, see: <link> If I see "10 automation job failures were associated with this bug yesterday" one day and "30 automation job failures were associated with this bug yesterday" on another day, one interpretation is "this bug just got 3x worse/more frequent". Another possibility is that there were 3x as many pushes that day (consider weekends and regional holidays obviously, but also tree closures, company meetings, etc). In addition, it is currently unclear if NN failures in a day (week) is frequent or not. A particular concern might be, if I push to try, how many failures are expected in 10 retries?
Created attachment 8807348 [details] [diff] [review] add push count and orangefactor to comments This adds 1 new line to each comment. For example: # Bug 1285173: 19 automation job failures were associated with this bug yesterday. (19 failures in 119 pushes, or 0.16 failures per push.) Repository breakdown: * autoland: 14 * mozilla-aurora: 2 * try: 1 * mozilla-release: 1 * mozilla-central: 1 Platform breakdown: * linux64: 7 * osx-10-10: 6 * windows7-32-vm: 5 * windows8-64: 1 For more details, see: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1285173&startday=2016-11-02&endday=2016-11-02&tree=all
Another way to look at this is What is the 7 day total for this bug and the 1 day total- and what ranking does this bug have. If we switch to an engineering model where we say "thou shalt not have any intermittent test with >XX instances in a 7 day window"- it would be nice to have the ability to see in the bugzilla comment: "This bug exceeds our acceptable weekly limit, now is the time to increase the priority of this bug". Alternatively, if we have a secondary rule that says "any bug over the threshold for >14 days will be disabled automatically", then we could have OF query that for us and comment in the bug so we are all aware that we need to disable the test. I am looking forward to the original data mentioned and possibly playing with additional rules/info.
So if I have two intermittent-failure bugs assigned, one which fails in Linux32 opt e10s M(3) which runs every push, and one which fails in Linux32 opt e10s M(4), which runs every 8th push, I should work on the M(3) one because it's "0.125 failures per push" while the M(4) one is "0.042 failures per push" despite failing 25% of the time it runs?
that is an interesting point :philor. I think that is something we need to figure out- right now what data orangefactor provides for #1 orange or top 50 oranges is calculating failures/push, so SETA being involved only means our orange factor should be much higher. In the short term this will help provide additional data which can help make decisions- ideally we can collect more data to make what Orange Factor presents even more useful.
My primary motivation for adding push count and per-bug orange factor is to provide better comparison of failure rates over time within each bug: Is this test failing more frequently now than it was yesterday/last week? I want to reduce misunderstandings such as: - there were half as many failures yesterday as there were the day before, so maybe I don't need to worry about this bug (forgetting that the trees were closed yesterday); - there were twice as many failures yesterday as there were the day before, so maybe I should perform some regression analysis to determine which changeset caused the change in frequency (only to find that the failure rate has not changed significantly). I am not trying to draw conclusions about the relative importance of one bug to another; I think rankings, like bug 1315275, or jmaher's ideas in comment 2, or even the existing simple failure counts, better address that issue. SETA, and perhaps other load-reducing mechanisms, certainly complicate the use of push count. I'd much rather provide test run counts: "(19 failures in 119 runs of this test, or 0.16 failures per test run.)", but I don't see how to implement that. Push count seems like the best approximation that is readily available. Remember that failure and push counts in bug comments are totals from across all trees, mitigating SETA effects, a little. Also, I am using parentheses around the comment addition in an effort to say, 'this is just FYI, the important thing is the simple failure count, above'. I have two concerns: - That this effort to provide more insight into the meaning of the failure counts will result in new misunderstandings -- particularly :philor's scenario in Comment 3; - That the addition of failure counts -- and perhaps rank and other information, in other bugs -- may complicate and confuse the overall message. I like the simplicity of the current messaging: N1 failures yesterday, N2 failures the day before, .... I don't want to end up with a daily paragraph of statistics that no one will read. Overall, I think I want to go ahead with this idea. I wonder if we can find better wording or otherwise tweak the concept.
I think we should be able to move ahead with this, but possibly tweak it as we see what it ends up like in practice. How do we avoid misunderstandings and too much wording... proposed in comment 0: 19 automation job failures were associated with this bug yesterday. (19 failures in 119 pushes, or 0.16 failures per push.) alternative 1: 19 failures in 119 pushes (0.16 failures/push) were associated with this bug yesterday. alternative 2: Yesterday: 19 failures 119 pushes 0.16 failures/push to address the concern of confusion or making this irrelevant, could we do something like: alternative 3: Yesterday: 19 failures (increase from 17) 119 pushes 0.16 failures/push (increase from 0.14) ^ that might be redundant as you can see that in the previous comments. alternative 4: Yesterday 19 failures (0.16 failures/push) ^ here I removed total pushes as that isn't always useful. To help provide relevancy, maybe we have priority following a pattern: Priority 0: >100 failures in a 7 day window (if there are 7 days, otherwise project it) Priority 1: 50<x<=100 failures in a 7 day window Priority 2: 10<x<=50 failures in a 7 day window Priority 3: <=10 failures in a 7 day window ^ NOTE: 10, 50, 100 are arbitrary- if we choose something like this, then we should discuss what to pick, I would think maybe 10,30,100 would be more appropriate.
Thanks much for the alternatives. My favorite is alternative 1: 19 failures in 119 pushes (0.16 failures/push) were associated with this bug yesterday. ...brief, with no loss of information.
Created attachment 8808417 [details] [diff] [review] add push count and orangefactor to comments For example: # Bug 1206887: 7 failures in 606 pushes (0.012 failures/push) were associated with this bug in the last 7 days. Repository breakdown: * autoland: 4 * mozilla-inbound: 1 * mozilla-central: 1 * fx-team: 1 Platform breakdown: * android-4-3-armv7-api15: 7 For more details, see: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1206887&startday=2016-10-31&endday=2016-11-06&tree=all
Comment on attachment 8808417 [details] [diff] [review] add push count and orangefactor to comments Review of attachment 8808417 [details] [diff] [review]: ----------------------------------------------------------------- Not tested, but looks fine to me :-) Thank you for doing this!
Attachment #8808417 - Flags: review?(emorley) → review+
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.