Closed Bug 1109243 Opened 10 years ago Closed 9 years ago

Talos could benefit from the ability of turning a job orange on try server

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jmaher, Assigned: jmaher)

References

Details

Attachments

(1 file, 3 obsolete files)

when there is a new feature in development or a fix for a known performance regression, it is likely that talos is run on try server.  Wouldn't it be great if we could see orange on try server if we suspect a regression?

This is doable, and we can be smart about it to increase the certainty that this is a real regression.  Of course there are many unique situations to think about, but it is something we can move forward on and maybe tighten the loop.
a few things to sort out:
* how do we deal with improvements?
* so we get all the tests in a suite ran?  right now we don't
* can we generate a return value to buildbot to retry the job?
* we need to make this try only - possibly a flag via trychooser?
* I require 3+ data points and remove an outlier- is that valid?


This would be great to make a requirement for folks looking to land on beta/aurora.  The danger there- do they really push to try for those branches and we only run pgo, that could be frustrating.
uploading my latest patch, this has been working great on try server.  Here are some features (example: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c794046975a4):
* if geometric mean of values for test is outside the 7 days range auto retry
* if we are re triggering a specific job, automatically capture sps profile
* require 3 data points to post a sustained regression|improvement
* job is retried automatically until we get to 3 data points (serially)
* job is orange as either test-unexpected-fail (regression) or test-unexpected-pass (improvement) to drawn attention to the great work
* we default to mozilla-central as the source of truth
* this doesn't work with android, just desktop tests
* this is hacked to only work on branches that match Try*
* tested with all desktop tests/platforms, but not PGO or e10s

That is a lot of stuff.  Please weigh in on concerns with this and other things which we should be aware of.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8538511 - Flags: review?(wlachance)
Attachment #8538511 - Flags: feedback?(avihpit)
Depends on: 1113146
Comment on attachment 8538511 [details] [diff] [review]
turn talos orange when numbers are out of the ordinary (1.0)

Review of attachment 8538511 [details] [diff] [review]:
-----------------------------------------------------------------

The patch adds lots of unrelated general updates, features, and applies decisions which deserve explanations (such as 3 data points instead of 12 for regression detection).

Also, it seemingly changes the concept of talos as a test numbers generator, and makes it reference historical data etc. This is a very big deviation from how the system works now.

- Please divide the patch to few parts, including one which is the general unrelated changes.

- Please describe how are the talos changes interacting with tree herder.

- etc.
Attachment #8538511 - Flags: feedback?(avihpit) → feedback-
Depends on: 1113225
Depends on: 1113273
Comment on attachment 8538511 [details] [diff] [review]
turn talos orange when numbers are out of the ordinary (1.0)

going to split this into a few sub patches
Attachment #8538511 - Flags: review?(wlachance)
Depends on: 1113576
removing sps profiling, auto retries, etc.

This adds a little bit of code to run_tests.py as well as some support functions in compare.py.
Attachment #8533891 - Attachment is obsolete: true
Attachment #8538511 - Attachment is obsolete: true
Attachment #8539377 - Flags: review?(wlachance)
Comment on attachment 8539377 [details] [diff] [review]
turn a job orange on try if the data values are out of the norm (1.0)

Review of attachment 8539377 [details] [diff] [review]:
-----------------------------------------------------------------

r- because of my concern over android.

::: talos/compare.py
@@ +279,5 @@
>      return {'low': low, 'high': high, 'avg': average, 'geomean': geomean, 'count': count, 'data': vals}
>  
>  
> +def getCurrentPlatform(e10s):
> +    #TODO: when we add android support to compare.py, add it here

Will talos try jobs continue to work as expected on Android? It sounds from our discussion on irc that this is not the case...

@@ +314,5 @@
> +
> +    platform = getCurrentPlatform(e10s)
> +    return compareResults(revision, branch, master_branch, 0, [platform], [testname], pgo)
> +
> +def makeStatusMessage(results, testname, master_branch, min_count):

I think this method would go better in run_tests.py. It really has nothing to do with comparisons.
Attachment #8539377 - Flags: review?(wlachance) → review-
updated, we have support for android, so that shouldn't be an issue anymore (yay winter break contributors!)
Attachment #8539377 - Attachment is obsolete: true
Attachment #8543026 - Flags: review?(wlachance)
Comment on attachment 8543026 [details] [diff] [review]
turn a job orange on try if the data values are out of the norm (2.0)

Review of attachment 8543026 [details] [diff] [review]:
-----------------------------------------------------------------

Two things I'd like addressed before pushing (along with plenty of testing). Sorry for missing them the last time through.

::: talos/compare.py
@@ +311,5 @@
> +            platform = 'WinXP'
> +    if mozinfo.os == 'mac':
> +        if mozinfo.version == "OS X 10.6.8":
> +            platform = "OSX64"
> +        #TODO: verify this

Please actually verify this (and take the comment out) before pushing. :)

@@ +394,5 @@
>                      status = ':('
>                      if t in reverse_tests:
>                          status = ':)'
>  
> +                platVal['status'] = None

I think we should just define a global "REGRESSION" and "IMPROVEMENT" enum and use that (using a method like this: http://stackoverflow.com/a/702903). Trying to overload "True" and "False" to mean regression or improvement is even more confusing that doing comparisons against ':)' and ':('.
Attachment #8543026 - Flags: review?(wlachance) → review+
Blocks: 1119444
Do we still want to do this? I would be inclined to drop this in favor of further improvements to the compare view in perfherder.
Flags: needinfo?(jmaher)
I think we should drop this for now.  We can revisit it later, but the concept of making compare view more included into our normal work flow should solve issues:)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Flags: needinfo?(jmaher)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: