We currently have some things in place to make 24-hour backouts a reality in perf regressions, but a lot of work is still left. Lets use this bug as a tracker. :jmaher points out the stages in the life of perf sheriff: > -1: needs attention > 0: new > - possibly do something different if this is a merge (look on other >branches, etc. - for automation we don't need to) > 1: backfilling: needs backfilling (could be the same as #1) > - mozci to verify rev +- 2 (rev-2, rev-1, rev, rev+1, rev+2) has data > - mozci to schedule 6 data points builds/jobs for the rev a+- 2 (might need a repeat if there are no builds) > - need to do this in 2 parts, 1 ensure we have builds, 2, ensure we have tests > - move to stage -1 if we cannot fill in the holes 100% (i.e. build bustage, dontbuild, trees closed, etc.) > 2: has more data for specific test > - somehow verify we have a non merge revision and that revision 'a' is where we shift (we could script this in perfherder/alertmanager) > 3: needs all-talos run > - mozci: given revision A showing a regression, schedule all-talos (6 runs) for all tests/platforms for Rev A and A-1. > - mozci: might have to wait for builds > 4: has all-talos data for revision a and a-1 > - sanity check we have the full set of data > 5: bug filed > 6: closed (wontfix, backout, fixed)
A rough state machine suggested by :jmaher for alert in alerts: startRev = getPushLog(alert.rev) - 2 endRev = getPushLog(alert.rev) + 2 dataPoints = perfherder.query(alert.branch, alert.platform, alert.test, startRev, endRev) switch alert.stage: case 0: #new if getRevision(alert.rev).merge: case = -1 break if alert.branch.endswith('pgo'): case = -1 break alert.stage = 1 case 1: #backfilling if len(dataPoints) < 5: status = mozci.trigger(alert.buildername, startRev, endRev, times=6) if len(status.builds) > 0: alert.stage = 1 # we are waiting on builds, need to run this again else: alert.stage = 2 break alert.stage = 2 case 2: # enough data after initial backfilling, verify status = mozci.trigger(alert.buildername, startRev, endRev, times=6) if status.builds > 0 or status.pending > 0 or status.running > 0: alert.stage = 1 # waiting on builds/tests break if len(dataPoints) < 5: alert.stage = -1 # all builds are done, missing jobs for revisions break for data in dataPoints: if len(data) < 6: alert.stage = -1 # all builds are done, missing data for jobs break # analyze the data, find specific revision: pl = getPushLog() badRevisions =  for rev in pl[startRev:endRev]: results = perfherder.compare(pl[rev], pl[rev-1], alert.branch, alert.platform, alert.test) if results.change < -2.0: badRevisions.append(rev) if len(badRevisions) != 1: alert.stage = -1 # too noisy, other issues break if getRevision(badRevisions).merge: case = -1 break if alert.rev != badRevisions: alert.rev = badRevisions # we misreported initially, possibly update other tools/status alert.stage = 3 case 3: mozci.trigger_all_talos(alert.rev, alert.branch, times=6) previous_rev = getPushLog(alert.rev) - 2 mozci.trigger_all_talos(previous_rev, alert.branch, times=6) alert.stage = 4 break case 4: # verify all data exists, i.e. jobs are completed
We had a meeting, and these are some things to take action on: https://etherpad.mozilla.org/perf-backouts
closing out old bugs that haven't been a priority
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.