Ascertain which Android test suites have >30% failure rate and hide them

RESOLVED FIXED

Status

Tree Management Graveyard
TBPL
--
major
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

Trunk
ARM
Android
Dependency tree / graph

Details

(Whiteboard: [sheriff-want])

Attachments

(2 attachments)

(Assignee)

Description

6 years ago
The current overall Android test failure rate is extremely high, leading to:
* People not taking *any* Android test failures seriously, real or otherwise.
* Many retriggers, increasing load dramatically, which is a contributor to bug 772458 (and combined with things like bug 777273 the situation becomes dire).
* Sheriffs just starring all Android failures generically with 'a' (and have been for months), since there are just too many failures to open each log and star with the correct bug across 10+ trees. (Even starring generically, Android test failures eat up a significant proportion of my day).

People like gbrown, jmaher, Callek, armeng (and many more) are working hard at getting the failures (both infra/hardware and test specific) resolved (for which I'm exceptionally grateful), however we cannot wait any longer. As it is it will be weeks, if not months before we undo the conditioning of all devs to just ignore all Android test failures - even once the tests are routinely green.

Many of the failures are in only a handful of the test suites - so in the short term, we should just hide the worst behaving suites (those with over 30% failure rate). 

This will:
* stop people being tempted to retrigger the unreliable test suites, reducing tegra load.
* improve the overall perception of the reliability of Android tests - starting the slow journey back to people trusting them.
* mean that the tests can still be viewed using &noignore=1 added to the TBPL URL; - it's not like we're disabling them and losing coverage.
(Assignee)

Comment 1

6 years ago
Native Android:

from: 5bd9db1381a6 (bbondy@moco – Sat Jul 21 18:58:07 2012 UTC+1)
to: fe77957061ea (jmathies@moco – Wed Jul 25 10:28:32 2012 UTC+1)

~mochitest-1~
 28 failing, 73 total (38%)
~mochitest-2~
 17 failing, 67 total (25%)
~mochitest-3~
 27 failing, 70 total (39%)
~mochitest-4~
  7 failing, 64 total (11%)
~mochitest-5~
  7 failing, 65 total (11%)
~mochitest-6~
  9 failing, 71 total (13%)
~mochitest-7~
 12 failing, 69 total (17%)
~mochitest-8~
 28 failing, 71 total (39%)
~robocop~
 50 failing, 88 total (57%)
~crashtest-2~
  7 failing, 63 total (11%)
~crashtest-3~
  3 failing, 64 total ( 5%)
~jsreftest-1~
 11 failing, 72 total (15%)
~jsreftest-2~
 10 failing, 66 total (15%)
~jsreftest-3~
  4 failing, 61 total ( 7%)
~reftest-1~
  9 failing, 67 total (13%)
~reftest-2~
 41 failing, 81 total (51%)
~reftest-3~
 46 failing, 86 total (53%)
~remote-tdhtml~
  3 failing, 59 total ( 5%)
~remote-trobocheck~
 24 failing, 66 total (36%)
~remote-trobocheck2~
 18 failing, 64 total (28%)
~remote-trobocheck3~
 32 failing, 78 total (41%)
~remote-trobopan~
 26 failing, 66 total (39%)
~remote-troboprovider~
 20 failing, 71 total (28%)
~remote-tsvg~
  8 failing, 60 total (13%)
~remote-tp4m_nochrome~
  5 failing, 63 total ( 8%)
~remote-ts~
 45 failing, 86 total (52%)
(Assignee)

Comment 2

6 years ago
Meant to add, the figures in comment 1 and here, exclude the blue retries; but include all other failures.

XUL Android:
(Same timeframe as comment 1)

~mochitest-1~
  6 failing, 66 total ( 9%)
~mochitest-2~
  6 failing, 69 total ( 9%)
~mochitest-3~
  2 failing, 64 total ( 3%)
~mochitest-4~
  6 failing, 63 total (10%)
~mochitest-5~
  2 failing, 66 total ( 3%)
~mochitest-6~
  4 failing, 63 total ( 6%)
~mochitest-7~
  2 failing, 63 total ( 3%)
~mochitest-8~
 10 failing, 67 total (15%)
~crashtest-2~
  4 failing, 63 total ( 6%)
~crashtest-3~
  4 failing, 62 total ( 6%)
~jsreftest-1~
 16 failing, 67 total (24%)
~jsreftest-2~
 15 failing, 68 total (22%)
~jsreftest-3~
  3 failing, 59 total ( 5%)
~reftest-1~
  2 failing, 63 total ( 3%)
~reftest-2~
  3 failing, 62 total ( 5%)
~reftest-3~
  9 failing, 65 total (14%)
(Assignee)

Comment 3

6 years ago
Setting a threshold of 30% for native Android, and 20% for XUL (given that we only need it to ensure the metro/B2G stuff still works), leaves the following:

Native:
* mochitest-1          38%
* mochitest-3          39%
* mochitest-8          39%
* reftest-2            51%
* reftest-3            53%
* robocop              57%
* remote-trobocheck    36%
* remote-trobocheck3   41%
* remote-trobopan      39%
* remote-ts            52%

XUL:
* jsreftest-1          24%
* jsreftest-2          22%

The above have been hidden on:
* mozilla-central
* mozilla-inbound
* fx-team
* services-central
* try

I have left out project repos, since many are not running Android tests / sheriffs don't have to star them so makes little difference.

Will file dependants for unhiding each, next.
(Assignee)

Updated

6 years ago
Whiteboard: [sheriff-want]
(Assignee)

Comment 4

6 years ago
I've just added "Show all Android tests" links to the TBPL status messages for each of the trees in comment 3, linking to " ...&jobname=Android&noignore=1" (similar to how we've done it for Spidermonkey builds on inbound for some time).
(In reply to Ed Morley [:edmorley] from comment #0)
> * improve the overall perception of the reliability of Android tests -
> starting the slow journey back to people trusting them.

I think this bug is a good change, but I did the math on the percentages in comment 1. Even with everything with a 30% failure right or higher discarded, there's a 92% chance of a bogus failure in at least one of the remaining tests. So perhaps this could help perception somewhat, but I'd hold off on any announcements that everything is better to avoid a "never cry wolf" effect. (Never cry no wolf?)

It does mean that the odds of only needed a single retrigger on a visible failure are much higher.

I suppose automated retriggering would get things to a decent state, but that seems very bad from a load perspective.

Perhaps if a build of a later changeset is green, it could "auto-star" earlier builds? (Just in the tbpl UI, I mean.)
(Assignee)

Comment 6

6 years ago
(In reply to Steve Fink [:sfink] (vacation Jul30-Aug10) from comment #5)
> I think this bug is a good change, but I did the math on the percentages in
> comment 1. Even with everything with a 30% failure right or higher
> discarded, there's a 92% chance of a bogus failure in at least one of the
> remaining tests. So perhaps this could help perception somewhat, but I'd
> hold off on any announcements that everything is better to avoid a "never
> cry wolf" effect. (Never cry no wolf?)

Yeah I wasn't going to shout anything yet :-)
(This bug was as much for my sheriffing sanity as anything else - at least in the short term).

Post bug 775227's bad test disabling, the revised figures (inbound since ~Fri) for the hidden Native mochitests are:

Native:
* mochitest-1: 21 failing, 81 total (26%)
* mochitest-3: 76 failing, 78 total (97%)
* mochitest-8: 15 failing, 82 total (18%)
* robocop    : 39 failing, 81 total (48%)

I still haven't filed the dependant bugs, will do so after the a-team meeting. Will also try to track down where the m3 failure rate jumped up from that in comment 3 + un-hide m8 since it seems much better behaved now :-)
(Assignee)

Comment 7

6 years ago
Native Android M8 unhidden on all trees listed in comment 3.
looks like we won't have anything running before too long:)
(Assignee)

Updated

6 years ago
Depends on: 778952
(Assignee)

Updated

6 years ago
Depends on: 778954
(Assignee)

Updated

6 years ago
Depends on: 778956
(Assignee)

Updated

6 years ago
Depends on: 778958
(Assignee)

Updated

6 years ago
Depends on: 778960
(Assignee)

Updated

6 years ago
Depends on: 778961
(Assignee)

Updated

6 years ago
Depends on: 778962
(Assignee)

Updated

6 years ago
Depends on: 778963
(Assignee)

Updated

6 years ago
Depends on: 778964
(Assignee)

Updated

6 years ago
Depends on: 778965
(Assignee)

Updated

6 years ago
Depends on: 778967
(Assignee)

Comment 9

6 years ago
(In reply to Joel Maher (:jmaher) from comment #8)
> looks like we won't have anything running before too long:)

At least that might mean we get buy-in from platform... :-)

Joking aside, unless something drastic happens, we should only be unhiding them from this point forwards (eg M8 which has already improved enough to unhide).
(Assignee)

Comment 10

6 years ago
Created attachment 647531 [details] [diff] [review]
TBPL failure stats

This patch allows TBPL to show failure stats for the current view. Bit hacky but gets the job done.

To use:
1) Check out http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/
2) Apply patch
3) Run index.html from the local filesystem
4) Adjust filters to whichever Android suite(s) you would like included in the stats
5) Stats displayed top right of the UI, where the unstarred count normally is.

Since this method uses the server side components from prod, the data imports and hidden builders will all match prod TBPL :-)

It also changes the default TBPL refresh rate from 120 secs to 99999, to avoid the extreme janking you get when trying to look at the last several days of runs & it does a "Loading...".
Assignee: nobody → bmo
Status: NEW → ASSIGNED
(Assignee)

Updated

6 years ago
Depends on: 771626
(Assignee)

Updated

6 years ago
Depends on: 779871
(Assignee)

Comment 11

6 years ago
Created attachment 649047 [details]
Now that's what I'm talking about! :-D
(Assignee)

Comment 13

6 years ago
The last hidden Android {native,XUL} suite was unhidden in bug 778954 -> closing this out :-)
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: Webtools → Tree Management
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.