intermittent failure view graph should show failures/runs instead of failures/pushes
Categories
(Tree Management :: Treeherder: Frontend, task)
Tracking
(Not tracked)
People
(Reporter: jmaher, Unassigned)
References
(Blocks 2 open bugs)
Details
This is a major flaw in the graph, but hard to fix. right now we calculate the total number of pushes for a given day and compute:
orange_factor = total_test_failures / total_pushes
this is misleading because we do not run every test on every push. In some cases if the test is only ran on limited hardware or mozilla-central, it will run <10x/day, but we will record >100 pushes/day.
Instead we should use:
orange_factor = total_test_failures / total_test_runs
This becomes difficult to assess as a test running in chunk 1 could be in chunk 5 the next push based on chunk balancing algorithms. Also we run many build types (i.e. opt, debug, asan, etc...) and many test variants. ideally we could calculate the orange factor for each unique platform/buildtype/variant, then aggregate that into a per bug orange factor, and furthermore an overall orange factor. This would allow for drilling into a specific bug and seeing more details, likewise filtering on a build type or test variant to see failure patterns.
right now we do collect the test run information as an artifact on the test-info-all task that runs on mozilla-central. Unfortunately this appears to have issues as it gets the runcount data from treeherder (e.g. https://treeherder.mozilla.org/api/groupsummary/?startdate=2024-10-06&enddate=2024-10-07 ) which is giving permanent 502 errors. This is basically pulling data from:
https://github.com/mozilla/treeherder/blob/57537ac0cb2f8b73f0acfd0bc49591f5a13fe9de/treeherder/webapp/api/groups.py#L17
The alternative here is we can cache this summary information in the treeherder database instead of pulling it out into an artifact and exposing it as an API. Going this route is more work for treeherder, but might end up being cleaner, some pros/cons of this approach:
pros:
- no need to proxy information to another task, creating more artifacts, etc.
- remove api end point that can put a lot of load on the TH DB if scanned
- data is queried from the source
cons:
- restricted to 4 months of data (or whatever we have in TH) - it would be nice to have a full 12 months of data
- would need to make this data much faster to query (maybe some intermediate tables, etc.), or restructure in general how we store this data
- data is only available to internal treeherder (unless we expose via another new API)
I think the pros outweigh the cons, and if we end up with extra tables or restructuring the way we store data, when we mirror this to a longer term read only database, we could theoretically query from there instead of the hot TH DB.
Reporter | ||
Updated•27 days ago
|
Reporter | ||
Comment 1•26 days ago
|
||
in general the calculation of runs follows:
- summarize all instances of a test_group (a.k.a. test manifest) that ran
- sort by unique test run (platform, buildtype, variant- ignore chunk)
- ASSUME all tests ran (edge case here is a crash can cause us to terminate running tests after the crash)
- potentially look at failures and filter out "crash" from the total
- for a failure on a given test run: count(failures_matching_testrun) / count(total_tasks_testgroup_testrun)
- this will generate between 1 and 100 different datapoints for a given failure, 1 for each testrun
what we care to show is which failures are failing most or all of the time, then of course if things get fixed it would be nice to see that line go "down".
A simple way is to do something like: count(failures) / count(total_tasks_testgroup) - effectively ignoring the testrun.
A more complex way would be:
- sum(failure_rate_testrun) / count(failure_rate_testrun)
possibly a more interesting view would be extra lines, using the default line as a total_failures/total_tasks type of approach, but add these additional lines:
-
configs > 20% failure rate
-
configs < 20% but >0% failure rate
-
configs with 0 failures!
If this could be generated per day, then one could graph it- of course this would be a lot of data points, so showing it for a single bug would be key.
there is an edge case for infra or non specific manifest failures. These will need to be accounted for differently- maybe we can filter them out and just show a count/day? Alternatively, one could look at the platform/build_type/harness this failure shows up on, and count the tasks for that. I would ignore test variants in this equation.
If we wanted to summarize over all bugs we would need to handle the edge case of non test specific failures (infra, task level, etc.) and could summarize like we do across all testruns, just add the sum(failures) / count(all_related_tasks) for a given day.
Reporter | ||
Comment 2•19 days ago
|
||
as a note, the groupsummary/ endpoint for treeherder is working again and the original data source (test-info-all.json) has all the information again.
Description
•