Closed Bug 784681 Opened 12 years ago Closed 9 years ago

[Meta] Fix + unhide broken testsuites or else turn them off to save capacity

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: emorley, Assigned: catlee)

References

(Depends on 1 open bug)

Details

(Keywords: meta, Whiteboard: [capacity])

Attachments

(2 files)

We currently have a few testsuites hidden on TBPL that are just wasting capacity at present (http://oduinn.com/blog/2012/08/21/137-hours-compute-hours-every-6-minutes/). We should either fix and unhide them, or else switch them off. There will be the odd exception/grey area (eg Spidermonkey builds, which are 'needed' but sometimes take weeks to get fixed) - but there are still many that need sorting out (eg OS X 10.7 tests).

I'm going to start by making a list of all hidden testsuites here, then file dependant bugs
bah, submitted the form by accident.
Whiteboard: [capacity][ → [capacity]
Few points:
* Android XUL is going away in a week, so we can ignore those.
* Anything only hidden on beta, will only be for the next week, so not worth looking at (philor/myself will unhide on uplift).
* OS X 10.5 is being switched off in bug 773120, so not counting those either.

This leaves the following hidden jobs:

* jetpack tests on all platforms (mozilla-central, inbound, aurora):
  -> Are apparently looked at regularly (using &noignore=1). Would be good to confirm people are definitely checking them.

* spidermonkey builds on inbound:
  -> Same as above (and these are only run on changes to the js/src/ directory). Though the odd build is often busted for weeks without anyone noticing (eg I recently filed https://bugzil.la/778460,778469).

* peptest tests on all platforms (mozilla-central, inbound):
  -> Are faulty, since will stay green even when the browser crashes on startup for all other tests. Not sure what the plan is.

* OS X 10.7 on trunk trees + aurora + beta:
  debug (crashtest, mochitests-1, mochitests-3, mochitests-5, xpcshell)
  opt (xpcshell)
  -> need greening up, bug 700503 has been filed for this. We will ideally uplift fixes to aurora/beta.

* OS X 10.7 on esr10:
  debug (crashtest, mochitest 1+2+3+5+oth, reftest, xpcshell)
  opt (mochitest 1+2+5+oth, reftest, xpcshell)
  -> If the fixes from bug 700503 aren't test-only, then we might as well switch these off, since we won't be backporting.

* Linux valgrind (both x86 and x64) on mozilla-central:
  -> No idea what the story is for these.

* xulrunner builds on all platforms (mozilla-central, aurora):
  -> Think these are busted and hidden since not tier 1 (/not available on Try?).

* Android mozilla-central l10n nightly *
  -> No idea why/what/where/who.

* All Win64 tests on mozilla-central:
  -> Too many failures + product decided win64 not priority but didn't want to turn them off. Hiding has pretty much the same effect, so we should still either fix or stop running.

* Jetpack tree (note: different from the jetpack tests run on trunk trees). Not sure who is responsible for these/out of whose machine time this is run. The following are hidden:
  jetpack-mozilla-central-leopard-opt
  jetpack-mozilla-central-leopard-debug
  jetpack-mozilla-central-w764-debug
  jetpack-mozilla-aurora-w764-opt
  jetpack-mozilla-aurora-w764-debug
  jetpack-mozilla-beta-w764-opt
  jetpack-mozilla-beta-w764-debug
  jetpack-mozilla-release-w764-debug
  jetpack-mozilla-release-w764-opt
  -> seems like we should stop running these on win64 on all trees and turn off for 10.5 on m-c.
Depends on: 700503, 773120, 777037
jetpack: it's very easy to tell that jetpack is being checked regularly - just look at whether or not they are green, or were green sometime before if they are currently orange. We break it frequently, Kwierso sees that we broke it, uses the on-push runs to look up where, files bugs, cc's people, and gets them to fix their bustage. The system works, and hidden !== ignored.

xulrunner: it's not tier 1, it is tier 2, and the owner is quite happy with the status quo (and yelled at me for making the mistake of unhiding it a while back)

Win64: product said that they wanted it to be tier-1 by "early next year" even though we frequently forget that. It only has too many bug 692715 failures, not too many failures, but as long as visible on mozilla-central still means tier 1, we can't unhide it until there are enough slaves to run it on try and on integration branches.

jetpack tree: there are an abundance of bugs filed on the way it tries to run jobs against builds that don't exist, there just isn't an abundance of people who want to write the hacky patch to avoid them.
Depends on: 737661, 778969
Yeah, I have the Jetpack tree's tbpl page open pretty much all the time, and I check the Jetpack tests on the various mozilla-* trees at least once a day.

Each one gives somewhat different, but all useful, results. The Jetpack tests on the mozilla-* trees run against a static version of the SDK (updated manually by me every few weeks to the current Jetpack code), so I can see if changes to mozilla code breaks something in Jetpack.

The Jetpack tree runs the current Jetpack trunk code against the most recent successful Nightly build of Firefox, which can show if changes to Jetpack code breaks against Firefox. (It'll also show the brokenness in the other direction, but doesn't give me any hint as to what change actually broke it, since all of the changes from one Nightly to the next get grouped as one.)

In the happy and glorious future where the Jetpack team lands their APIs directly into mozilla-central, the Jetpack tree can probably go away, but it's incredibly useful at the moment.
> * Linux valgrind (both x86 and x64) on mozilla-central:
>   -> No idea what the story is for these.

This is blocked by bug 750856 (upgrade Valgrind on build machines).
linking to bug#772458 and its chain of bugs, so all different groups involved can track overlapping work
Blocks: 772458
(In reply to Wes Kocher (:KWierso) from comment #4)
> Yeah, I have the Jetpack tree's tbpl page open pretty much all the time, and
> I check the Jetpack tests on the various mozilla-* trees at least once a day.

Cool - thank you for the clarification :-)
Depends on: 785373
> * peptest tests on all platforms (mozilla-central, inbound):
>   -> Are faulty, since will stay green even when the browser crashes on
> startup for all other tests. Not sure what the plan is.

Filed bug 785373.

(In reply to Phil Ringnalda (:philor) from comment #3)
> Win64: product said that they wanted it to be tier-1 by "early next year"
> even though we frequently forget that. It only has too many bug 692715
> failures, not too many failures, but as long as visible on mozilla-central
> still means tier 1, we can't unhide it until there are enough slaves to run
> it on try and on integration branches.

Added dependency on bug 692715 and 784891, though I believe the latter is just for builders - I can't seem to find one for increasing the win64 testpool size / switching on for all trees.
Depends on: 692715, 784891
Depends on: 785798
Depends on: 718510
No longer depends on: 718510
Depends on: 786424
Catlee, spotted you added "If you know of other builds or tests that aren't used, or are perma-red/orange, let us know and we can disable them until they can be fixed!" to the Tuesday meeting wiki. This bug covers most instances - I don't believe there is much (/anything) left now. (I've just added a link to this bug from that wiki entry).
Depends on: 786084
Think we can stop running B2G GB builds now, just waiting in bug 771653 for confirmation :-)
Depends on: 780915
Depends on: 789357
Depends on: 790624
Depends on: 790630
Depends on: 792300
Depends on: 803530
Depends on: 795513
Depends on: 812076
No longer depends on: 812076
Depends on: 822813
Depends on: 823989
Depends on: 821728
No longer depends on: valgrind-on-tbpl
No longer depends on: 790624
No longer depends on: 790630
No longer depends on: 792300
No longer depends on: 692715
No longer depends on: 784891
Depends on: 814009
Depends on: 827540
Depends on: 842426
Depends on: 862657
joduinn says catlee has data on jobs we're running that are hidden. catlee: can you make that available?
Assignee: nobody → catlee
I put up a copy of an internal report I generated for the past week here:
http://people.mozilla.org/~catlee/reportor/2013-05-02:19/hidden/hidden.html
Depends on: 868878
That's a purty graph, but it shows the total number of apples-and-oranges-and-hammers-and-emus.

If a job is hidden and green, that means that it should be unhidden, or that it should be shut off, or that it should be left exactly like it is.

If a job is hidden and non-green, that means that it should be fixed and unhidden, or that it should be fixed and left hidden, or that it should be shut off, or (considerably more rarely) that it should be left exactly like it is.

There are no shortcuts here, much as people want there to be every couple of months. If someone wants to reaudit every hidden job, I'll be happy to show them where the "Adjust Hidden Builders" link is in the "Tree Info" menu on tbpl for each tree, and to tell them how to type "hidden" in the filter box to see just the hidden jobs, and to help them find out the why behind the things they can't figure out.
It's at least a starting point for discussion. I'd like to understand better what's running, how much time it's taking, and what value it has.

Can we categorize each hidden job into one of these buckets:
- it's failing, and needs to be fixed
- it doesn't run per-push (xulrunner, valgrind), can fail, and someone may one day fix it
- it can fail due to external factors and shouldn't be backed out (b2g jobs pulling in tip-of-external-repos)
- fixed and should be unhidden
- other?

Perhaps this is something treeherder should be tackling as well?

The first in particular should have a TTL on it. If nobody is working on fixing it up, it's a waste of time to run. We can run it less frequently or not at all.
Depends on: 869290
The visibility changelog for all currently-hidden mozilla-central jobs. NB: Some of these hidden jobs may not be being run any more, but they'll still be listed in the TBPL builder visibility table, until the last run is removed by the ~monthly TBPL purge.

The amount of churn on visibility (and apparent confusion as people hide/unhide jobs with reasons such as 'why was this hidden') confirms the real need to have history surfaced in the UI (which thankfully bug 687143 is about to add). Hopefully that bug plus https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy will make things clearer moving forwards.
Same as previous, but for mozilla-inbound.
Personally, as the half of the "conversation" who knew it was a conversation, I always sort of liked those "hidden for cause" "why was this hidden?" "here's the reason again" "why?" "for the same reason again" ones.
(In reply to Chris AtLee [:catlee] from comment #15)
> Can we categorize each hidden job into one of these buckets:

Sure, it'll be fun! I'll explain what they are, and you can pick a bucket. Let's start with inbound:

* Seven SpiderMonkey builds, four (opt+debug, 32-bit and 64-bit) Linux warnings-as-errors shell builds that tend to be green and three (opt+debug warnings-as-errors, plus dtrace) Mac shell builds, of which the opt is currently green and the debug pair have been red for a long time, bug 862657. These were supposed to replace the builds on the TraceMonkey tree, which were supposed to keep JS more or less warnings-free, but not immediately - you would land your patch that caused some warnings, burn these builds, and then maybe later that week or the next week, you would fix them. There used to also be Windows builds, but in the absence of Brendan to yell at people about their warnings, nobody fixed the Windows ones while Waldo was off hiking, and I turned them off.

* Panda Fennec mochitest-webgl. This was the whole point of the recent operation to separate webgl out into its own mochitest hunk, because either nobody can agree on what to do about the failures and unexpected passes in the webgl tests on Pandas, or nobody can agree on who should do it, or something.

* Four b2g desktop builds. These let us tell who broke desktop b2g if someone does, but without having them visible because if they are visible then when it's a gaia commit that breaks them, as it mostly is, we back out the first gecko push that happened to get the bad gaia commit, and nothing we've done yet to make the difference clear (which mostly consists of TinderboxPrinting the gaia commit on jobs other than the desktop ones, but not on the desktop ones) had helped the inbound sheriffs (who are quite literally and exactly the entire set of people who have Level 3 commit access) tell the difference. Some bug which is not the bug I filed about making that situation better intends to make it better.

* Two WinXP xpcshell tests. They fail intermittently a lot. The way they fail intermittently tells developers nothing about how and why they failed, so they are unable to fix the intermittent failures. People tried to make that better. It didn't get any better. They were apparently talked about in the weekly meeting for weeks. Now they apparently are no longer talked about. It amuses me to star them sometimes, so I would certainly notice within a day or two if someone actually broke them (for real, not for intermittent timeout) with a push, and having them run on-push, I would be able to find out who broke them in fairly short order.

That's the entirety of our profligate waste of resources on hidden jobs on inbound: four b2g builds which we cannot turn off because we cannot break them, and we cannot show because we don't realize it wasn't us that broke them when they break, one Android test that I personally wouldn't mind seeing the back of but that's not my decision and two WinXP tests that ought to be fixed, but nobody knows who or how, plus seven shell builds that only happen on pushes to js/src/.

Central? The same, plus all the ASan/Valgrind/DXR/Static analysis crud which is all broken, plus XULRunner which the owner wants hidden, plus those busted b2g VM jobs that Cedar wasn't good enough for, plus the Fennec l10n nightlies which for some reason report there unlike any other l10n nightlies, other than b2g desktop, where instead of just the four on-push we also have nightlies and localizer nightlies, plus Winx64.

All the other trunkish trees? Same as inbound, minus the SpiderMonkey jobs, except Ionmonkey which has them, but runs everything unhidden. Aurora? Panda mochitest-1 will get unhidden on the next merge, as will jetpack since that'll give us the code that ordinary humans can star. Beyond that, it has XULRunner hidden like it should be, and the Panda webgl that's an eternal flame. Beta? Jetpack unhides in six weeks, meantime KWierso has to keep watching it hidden; if releng can avoid turning anything broken on in the meantime (Android 4.0 on Panda, I'm looking at you!), it'll then have zero hidden jobs. Release is the same, only in 12 weeks.

Net, bottom-line, on-push: if you want to crap up mobile_config.py even more, and make Panda mochitest-webgl only run on try and not-by-default, that would amuse me, but I think we have massively more Pandas than we know what to do with; you can't have my WinXP xpcshell, not yet anyway, and apparently desktop b2g is going to stop beating its wife so you can't have it either.
If you want small wins of totally utterly useless things, though, Gaia-Master hasn't successfully built its one build since the morning of April 22nd, and the gaia-ui-test that runs on Cedar hasn't done anything except set RETRY half a dozen times until it can do what it calls "success" which is printing out a message telling you that you have to read a devmo page, create a local file saying you've read it, and include that local file in the commandline you use to run the tests, since sometime between April 26th and April 30th. Pretty sure those both go in the "other?" bucket, the one for things turned on prematurely which will be ignored until the things that someone actually wants are turned on instead, at which time those thing may or may not be noticed and turned off.
Oh, gaia-ui-test actually run hidden on-push on Birch, b2g18 and b2g18_1_0_1_0_1, spending 10 to 30 minutes per push doing absolutely nothing of any use or value, so maybe it is actually worth the trouble of turning off.
Depends on: 873725
Depends on: 874007
No longer depends on: 874007
Depends on: 875633
Depends on: 876084
Depends on: 877536
Depends on: 880273
Depends on: 873904
Depends on: 887642
Depends on: 903238
Depends on: 906383
Depends on: 916258
Depends on: 923570
Depends on: 923572
Depends on: 923880
Depends on: 923881
Depends on: 923882
Depends on: 924245
Depends on: 925279
Depends on: 932146
Depends on: 934257
Component: Tinderboxpushlog → General Automation
Product: Webtools → Release Engineering
QA Contact: catlee
Version: Trunk → other
Depends on: 929172
Depends on: 912502, 669384
Depends on: 960072
Depends on: 975683
Depends on: 975216
Depends on: 983269
Depends on: 980997
Depends on: 942111
Depends on: 1000123
Depends on: 1002800
Depends on: 1017607
It's been ages since we added any of the broken, hidden, and never going to be fixed dependencies to this bug.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.