Open Bug 1337977 Opened 5 years ago Updated 4 years ago
Display failure frequency in Treeherder
ActiveData has accurate pass/fail data for tests: A query can determine failure and run counts for a given test over a given time period. Reporting that information in Treeherder in proximity to a failed test would provide insight into whether a failure (or sequence of failures) is expected. I'm thinking that, on clicking (and/or perhaps hovering over) a failed job, TH could say "testABC failed 0 of 100 most recent runs" and that might provoke a very different reaction from an interested developer/sheriff than "testABC failed 87 of 100 most recent runs". For bonus points, consider color-coding of failures based on this information: bright orange perhaps for failures that are unexpected based on historical trends, to encourage extra attention...a nudge to sheriffs to look for a patch to backout, or a nudge to developers looking at try to rework their patch before check-in. Would it be inappropriate/inadvisable to rely on ActiveData in Treeherder? Might there be load concerns for ActiveData? If integration with Treeherder is problematic, perhaps we could consider alternatives like a mach command to examine a push and cross-reference failures in the push with failure frequencies from ActiveData. (I'm glossing over some issues here. I don't have all the specifics worked out, but I think there's some value in this idea...worth discussing/exploring.)
this is a great idea and something that if we are careful with we could implement outside of the scope of auto classification. in many cases a developer/sheriff sees and orange, clicks on it and if bug suggestions are displayed, they assume all is ok. They get bonus points if they look at all the failures and ensure they have matching bug titles. The existing holes that I see in our current solution are: 1) human doesn't guarantee the whole failure is what is in the bug title 2) human doesn't look at all the failures in the summary 3) human doesn't verify that the failure for the configuration they see is already existing I think the big win with this proposal is #3. If we could only show bug suggestions for actual failures for that specific test case which occur on that configuration, then we could effectively highlight failures of better importance for people to look at. For the case of try server, a developer would only be alerted of failures which are not seen in the last 30 days for that test failure and configuration. There could be a note indicating that this specific test case has other known failures in different configurations and this might be related. The downside to this is that if a new intermittent shows up (say win7/win8), but then the next day shows up on linux32/64, we have additional investigations to do- I don't think that is a downside, more of ensuring we are looking at outliers. I also see a few interesting patterns we could derive from here: 1) failures/config for 48 hours, 7 days, 14 days, 30 days- we could build rules based on different rates here 2) suggesting failures for common configs (debug*, linux opt *, etc.) 3) a collapsed list of low priority failures (i.e. ones we find matching in active data) could just say 14 already known failures exist, most likely not your problem. There are a lot of implementation details, I wonder if we could mock this up with examples, maybe a small tool to pull all data from a given revision in treeherder and cross reference failures in activedata- this could be a mach command, or a simple web experiment.
> I'm thinking that, on clicking (and/or perhaps hovering over) a failed job, TH > could say "testABC failed 0 of 100 most recent runs" and that might provoke a > very different reaction from an interested developer/sheriff than "testABC > failed 87 of 100 most recent runs". There are a couple of hurdles, I think: 1) TH does not know the test that failed: We have bugs and text logs, but nothing like a test GUID for the two systems to communicate (TH is improving faster than I can keep up, so maybe this is wrong) 2) TH is job focused, while ActiveData is test focused: TH covers failures that have no test to blame, or have multiple tests to blame. > Would it be inappropriate/inadvisable to rely on ActiveData in Treeherder? Currently, ActiveData's priority is on providing "low" latency indexes on "everything" over "long" time periods at the expense of data timeliness and uptime. These priorities were chosen for cost reduction and personal prioritization. This is a stark contrast from TH priorities; resulting in an impedance mismatch between the two systems. Anything connecting to ActiveData must have a strategy for dealing with latency; which has many solutions. > Might there be load concerns for ActiveData? Some days ActiveData is serves 2million requests: The vast majority from automation similar to what you are proposing. I do not beleive there is a load concern; latency is the biggest concern. >If integration with Treeherder is problematic, perhaps we could consider >alternatives There is the TestFailures dashboard : A prototype to show a test-centric view. It needs a lot of work, but it is good for showing what is failing, and what platform it is failing on. This type of dashboard ignores the job failures that TH shows, so it is missing all the failures that can not be blamed on tests. I would like to look into job failures that can not be blamed on test failures: It would be interesting for me to better understand how many there are, and what their character is.  "Low" latency is defined by 1 second per billion records required to answer your query, or less. I used quotes because I am painfully aware my priorities may not be impacting observed reality.  TestFailures dashboard: https://people-mozilla.org/~klahnakoski/testfailures/  https://docs.google.com/document/d/1LB4Ppj55rw9IFC-ggQr6Vy0yx8Gq-c6rcD7aNt_0Ib4/edit
You need to log in before you can comment on or make changes to this bug.