Closed Bug 500562 Opened 15 years ago Closed 15 years ago

Try server talos numbers are statistically useless

Categories

(Release Engineering :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: zwol, Unassigned)

Details

Currently, the try server runs talos against every patch submitted, on exactly two machines per patch per platform.  There appear to be four talos slaves per platform, and consecutively submitted patches are very likely to run on different machines.  There are 30 distinct talos results (six tests times five platforms).

If you are trying to determine whether your patch has a performance impact, you are therefore doing 30 separate comparisons of two unpaired samples, each sample containing two observations; this has very little statistical power even if you don't correct for multiple comparisons, and with 30 comparisons you had better correct, because at least one comparison is gonna hit 5% significance level just by chance.

I've run a basic power analysis in R: unpaired two-sample T-tests with the current setup can only detect a change greater than *five and a half standard deviations*, with alpha=0.05/30 and beta=0.05.  In short, it's useless.

We need a lot more observations per platform, and they need to be paired.  I recommend the following:

Turn off talos for the current try server.  Not only is it a waste of time, space, and electricity, it may be giving people the false impression that its numbers are useful.

Have a separate "performance try" installation.  This should have no fewer than 10 (yes, TEN) slaves per platform.  Every time it updates from mozilla-central, it cycles through a complete talos run on all of the slaves and saves these baseline results.  (I see no need to update from mozilla-central more often than daily.)

When you push a patch to this installation, it also cycles through a complete talos run on every last one of the slaves.  When this is completely done, you get *one* email containing a spreadsheet that lists all the results for all of the slaves, and the matching baseline numbers for the m-c update that your patch was relative to.  This setup would be able to detect a change of 0.6 standard deviations or greater, which is much more like it.  (As long as it's generating a spreadsheet, it might as well fill in formulas to do the proper statistical analysis, too.)

If your patch fails to build on some platform, it should just abort the whole thing and send you the failing build log instead.  The intent is that you push to the regular try server first.
You've obviously put some thought into this, and I'm not trying to dissuade you from pursuing Talos improvements or even a Talos redesign, but this isn't the way to go about it.

If you have ideas about how to fix the inadequacies of Talos, how about talking to other devs who use Talos, the releng team that supports and maintains Talos, and the IT team that supports the underlying hardware for Talos?

I appreciate your concerns about statistical relevance, but proposing a brand new system out of thin air without first discussing the ramifications with the people that would need to design and support and use that new system seems shortsighted at best. 

What you've proposed above is a non-trivial amount of work. If you're serious about this proposal, building some developer momentum behind it might give it more weight. I don't think bugzilla should be the starting point for this discussion.
Status: UNCONFIRMED → RESOLVED
Closed: 15 years ago
Component: Release Engineering → Release Engineering: Future
Resolution: --- → INVALID
If not bugzilla, I honestly have no idea where the discussion should happen.  What would you suggest?
mozilla.dev.builds with a cc to dev.planning would be fine.
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.