Open Bug 1473106 Opened 6 years ago Updated 1 year ago
make adjustments to sheriffing/intermittent policies/process to account for modern trends and tools
We have a longer history of triaging intermittent failures. The problem isn't necessarily getting better, but we are good at it now. It is clear that our current policies do not necessarily make things better, but they do reduce the pain significantly and keep it from getting worse. The landscape today is: * new failures show up on treeherder and we file a bug right away * continue staring failures to existing bugs * if a bug is frequently happening (fresh oranges) we investigate it by retriggering and finding a root cause * if a bug happens >=30 times in a 7 day window we add a stockwell tag and needinfo the triage owner for the bugzilla component * if a stockwell bug is inactive for 7 days, we reassess and needinfo again * if a stockwell bug has <30 failures in a week it becomes [stockwell unknown] * if a stockwell bug has >=200 failures in 30 days, we disable the test In bug 1473099, we are going to experiment with no triage, the purpose here is to determine the value we get from triage. We do not get a lot of pushback or concerns with the 200 failures in 30 days rule. That seems to be acceptable and a good balance. We now have test-verify and when bug 1465117 is fixed we can consider moving test-verify to tier-1. In doing this we will be able to easily run test-verify for all new intermittent failures and help determine if the failure is related to the push or not by running the failing test easily on the current push and the previous push (test-verify-backfill: TV-bf). The logic would look like: * Pass/Pass - intermittent test unrelated to the push; actions: ignore or retrigger more * Fail/Pass - failure is related to the current push; actions: consider backing out * Fail/Fail - failure is real and not related to push; actions: file intermittent bug * Pass/Fail - this shouldn't happen; actions: file bug to document odd case and retrigger to see if it is consistent I would propose that based on 30 days of data with test-verify as tier-1 and using TV-bf we could revisit the theoretical actions listed above to adjust policy. Likewise at the same time bug 1473099 will have enough data that we can propose any changes to triage and related whiteboard tags. The ultimate goal is to reduce bugs that we file which are most likely not going to reproduce and to find root causes/backout faster when tests become intermittent; all while spending less time and resources to find the answer.
:aryx, this is a proposal which I would like to do in Q3- Can you help me determine what looks good and what doesn't? Do you have other thoughts related to intermittent tests and outcomes?
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #0) > The ultimate goal is to reduce bugs that we file which are most likely not > going to reproduce and to find root causes/backout faster when tests become > intermittent; all while spending less time and resources to find the answer. How will the low-frequency failures be tracked? If they are not tracked, test-verify would have to run for every occurrence (and would a pool of untriaged low-frequency increases and require more testing automation resources compared to today)? Someone should keep an eye on what tests get disabled, e.g. some could be affected by infra issues or wrong classifications. The other parts of that workflow look fine.
currently we ran an experiment with test-verify backfill and this didn't result in what we wanted (maybe pass/pass scenarios resulted in higher frequency intermittents, although pass/fail and fail/fail scenarios seemed like a strong signal.) this leaves us with some action items to address with test-verify-backfill and maybe more experimentation before we can change when/how we file new bugs. This leaves the question of will we be tracking the low-frequency failures that happen <5 times/week? Right now bugs would track those and many of the bugs show 1-2 times/week and maybe that is it for a 4 week period. Many developers immediately say they care about that data as it could be a security bug or something more serious- while that is true, we have no confidence that the 200+ new intermittent bugs filed/week are looked at. I would rather us file bugs that match certain criteria than all random intermittents. Ultimately we should be tracking these failures in a database and filing bugs when certain criteria are hit (5+ instances in 14 days, crash stacks unrelated to timeouts, certain leak patterns, etc.) I think we can come up with an ideal scenario and plan for what to do with the "long tail" of really random bugs while work is being done to improve test-verify and related backfill and detection tools. :aryx, do you have a proposal or ideas for how to deal with the "long tail" and avoiding new bugs for each item?
You need to log in before you can comment on or make changes to this bug.