Closed Bug 777436 Opened 12 years ago Closed 12 years ago

Ascertain which Android test suites have >30% failure rate and hide them

Categories

(Tree Management Graveyard :: TBPL, defect)

ARM
Android
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

(Whiteboard: [sheriff-want])

Attachments

(2 files)

The current overall Android test failure rate is extremely high, leading to:
* People not taking *any* Android test failures seriously, real or otherwise.
* Many retriggers, increasing load dramatically, which is a contributor to bug 772458 (and combined with things like bug 777273 the situation becomes dire).
* Sheriffs just starring all Android failures generically with 'a' (and have been for months), since there are just too many failures to open each log and star with the correct bug across 10+ trees. (Even starring generically, Android test failures eat up a significant proportion of my day).

People like gbrown, jmaher, Callek, armeng (and many more) are working hard at getting the failures (both infra/hardware and test specific) resolved (for which I'm exceptionally grateful), however we cannot wait any longer. As it is it will be weeks, if not months before we undo the conditioning of all devs to just ignore all Android test failures - even once the tests are routinely green.

Many of the failures are in only a handful of the test suites - so in the short term, we should just hide the worst behaving suites (those with over 30% failure rate). 

This will:
* stop people being tempted to retrigger the unreliable test suites, reducing tegra load.
* improve the overall perception of the reliability of Android tests - starting the slow journey back to people trusting them.
* mean that the tests can still be viewed using &noignore=1 added to the TBPL URL; - it's not like we're disabling them and losing coverage.
Native Android:

from: 5bd9db1381a6 (bbondy@moco – Sat Jul 21 18:58:07 2012 UTC+1)
to: fe77957061ea (jmathies@moco – Wed Jul 25 10:28:32 2012 UTC+1)

~mochitest-1~
 28 failing, 73 total (38%)
~mochitest-2~
 17 failing, 67 total (25%)
~mochitest-3~
 27 failing, 70 total (39%)
~mochitest-4~
  7 failing, 64 total (11%)
~mochitest-5~
  7 failing, 65 total (11%)
~mochitest-6~
  9 failing, 71 total (13%)
~mochitest-7~
 12 failing, 69 total (17%)
~mochitest-8~
 28 failing, 71 total (39%)
~robocop~
 50 failing, 88 total (57%)
~crashtest-2~
  7 failing, 63 total (11%)
~crashtest-3~
  3 failing, 64 total ( 5%)
~jsreftest-1~
 11 failing, 72 total (15%)
~jsreftest-2~
 10 failing, 66 total (15%)
~jsreftest-3~
  4 failing, 61 total ( 7%)
~reftest-1~
  9 failing, 67 total (13%)
~reftest-2~
 41 failing, 81 total (51%)
~reftest-3~
 46 failing, 86 total (53%)
~remote-tdhtml~
  3 failing, 59 total ( 5%)
~remote-trobocheck~
 24 failing, 66 total (36%)
~remote-trobocheck2~
 18 failing, 64 total (28%)
~remote-trobocheck3~
 32 failing, 78 total (41%)
~remote-trobopan~
 26 failing, 66 total (39%)
~remote-troboprovider~
 20 failing, 71 total (28%)
~remote-tsvg~
  8 failing, 60 total (13%)
~remote-tp4m_nochrome~
  5 failing, 63 total ( 8%)
~remote-ts~
 45 failing, 86 total (52%)
Meant to add, the figures in comment 1 and here, exclude the blue retries; but include all other failures.

XUL Android:
(Same timeframe as comment 1)

~mochitest-1~
  6 failing, 66 total ( 9%)
~mochitest-2~
  6 failing, 69 total ( 9%)
~mochitest-3~
  2 failing, 64 total ( 3%)
~mochitest-4~
  6 failing, 63 total (10%)
~mochitest-5~
  2 failing, 66 total ( 3%)
~mochitest-6~
  4 failing, 63 total ( 6%)
~mochitest-7~
  2 failing, 63 total ( 3%)
~mochitest-8~
 10 failing, 67 total (15%)
~crashtest-2~
  4 failing, 63 total ( 6%)
~crashtest-3~
  4 failing, 62 total ( 6%)
~jsreftest-1~
 16 failing, 67 total (24%)
~jsreftest-2~
 15 failing, 68 total (22%)
~jsreftest-3~
  3 failing, 59 total ( 5%)
~reftest-1~
  2 failing, 63 total ( 3%)
~reftest-2~
  3 failing, 62 total ( 5%)
~reftest-3~
  9 failing, 65 total (14%)
Setting a threshold of 30% for native Android, and 20% for XUL (given that we only need it to ensure the metro/B2G stuff still works), leaves the following:

Native:
* mochitest-1          38%
* mochitest-3          39%
* mochitest-8          39%
* reftest-2            51%
* reftest-3            53%
* robocop              57%
* remote-trobocheck    36%
* remote-trobocheck3   41%
* remote-trobopan      39%
* remote-ts            52%

XUL:
* jsreftest-1          24%
* jsreftest-2          22%

The above have been hidden on:
* mozilla-central
* mozilla-inbound
* fx-team
* services-central
* try

I have left out project repos, since many are not running Android tests / sheriffs don't have to star them so makes little difference.

Will file dependants for unhiding each, next.
Whiteboard: [sheriff-want]
I've just added "Show all Android tests" links to the TBPL status messages for each of the trees in comment 3, linking to " ...&jobname=Android&noignore=1" (similar to how we've done it for Spidermonkey builds on inbound for some time).
(In reply to Ed Morley [:edmorley] from comment #0)
> * improve the overall perception of the reliability of Android tests -
> starting the slow journey back to people trusting them.

I think this bug is a good change, but I did the math on the percentages in comment 1. Even with everything with a 30% failure right or higher discarded, there's a 92% chance of a bogus failure in at least one of the remaining tests. So perhaps this could help perception somewhat, but I'd hold off on any announcements that everything is better to avoid a "never cry wolf" effect. (Never cry no wolf?)

It does mean that the odds of only needed a single retrigger on a visible failure are much higher.

I suppose automated retriggering would get things to a decent state, but that seems very bad from a load perspective.

Perhaps if a build of a later changeset is green, it could "auto-star" earlier builds? (Just in the tbpl UI, I mean.)
(In reply to Steve Fink [:sfink] (vacation Jul30-Aug10) from comment #5)
> I think this bug is a good change, but I did the math on the percentages in
> comment 1. Even with everything with a 30% failure right or higher
> discarded, there's a 92% chance of a bogus failure in at least one of the
> remaining tests. So perhaps this could help perception somewhat, but I'd
> hold off on any announcements that everything is better to avoid a "never
> cry wolf" effect. (Never cry no wolf?)

Yeah I wasn't going to shout anything yet :-)
(This bug was as much for my sheriffing sanity as anything else - at least in the short term).

Post bug 775227's bad test disabling, the revised figures (inbound since ~Fri) for the hidden Native mochitests are:

Native:
* mochitest-1: 21 failing, 81 total (26%)
* mochitest-3: 76 failing, 78 total (97%)
* mochitest-8: 15 failing, 82 total (18%)
* robocop    : 39 failing, 81 total (48%)

I still haven't filed the dependant bugs, will do so after the a-team meeting. Will also try to track down where the m3 failure rate jumped up from that in comment 3 + un-hide m8 since it seems much better behaved now :-)
Native Android M8 unhidden on all trees listed in comment 3.
looks like we won't have anything running before too long:)
Depends on: 778952
Depends on: 778954
Depends on: 778956
Depends on: 778958
Depends on: 778960
Depends on: 778961
Depends on: 778962
Depends on: 778963
Depends on: 778964
Depends on: 778965
Depends on: 778967
(In reply to Joel Maher (:jmaher) from comment #8)
> looks like we won't have anything running before too long:)

At least that might mean we get buy-in from platform... :-)

Joking aside, unless something drastic happens, we should only be unhiding them from this point forwards (eg M8 which has already improved enough to unhide).
This patch allows TBPL to show failure stats for the current view. Bit hacky but gets the job done.

To use:
1) Check out http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/
2) Apply patch
3) Run index.html from the local filesystem
4) Adjust filters to whichever Android suite(s) you would like included in the stats
5) Stats displayed top right of the UI, where the unstarred count normally is.

Since this method uses the server side components from prod, the data imports and hidden builders will all match prod TBPL :-)

It also changes the default TBPL refresh rate from 120 secs to 99999, to avoid the extreme janking you get when trying to look at the last several days of runs & it does a "Loading...".
Assignee: nobody → bmo
Status: NEW → ASSIGNED
Depends on: 771626
Depends on: 779871
The last hidden Android {native,XUL} suite was unhidden in bug 778954 -> closing this out :-)
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: Webtools → Tree Management
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: