[Meta] Tracking bug to bring 24 hours backouts a reality

RESOLVED WONTFIX

Status

RESOLVED WONTFIX
3 years ago
6 months ago

People

(Reporter: vaibhav1994, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
We currently have some things in place to make 24-hour backouts a reality in perf regressions, but a lot of work is still left. Lets use this bug as a tracker.

:jmaher points out the stages in the life of perf sheriff:

>  -1: needs attention
>   0: new
>   - possibly do something different if this is a merge (look on other >branches, etc. - for automation we don't need to)
>   1: backfilling: needs backfilling (could be the same as #1)
>   - mozci to verify rev +- 2 (rev-2, rev-1, rev, rev+1, rev+2) has data
>   - mozci to schedule 6 data points builds/jobs for the rev a+- 2 (might need a repeat if there are no builds)
>   - need to do this in 2 parts, 1 ensure we have builds, 2, ensure we have tests
>   - move to stage -1 if we cannot fill in the holes 100% (i.e. build bustage, dontbuild, trees closed, etc.)
>  2: has more data for specific test
>   - somehow verify we have a non merge revision and that revision 'a' is where we shift (we could script this in perfherder/alertmanager)
>  3: needs all-talos run
>   - mozci: given revision A showing a regression, schedule all-talos (6 runs) for all tests/platforms for Rev A and A-1.
>   - mozci: might have to wait for builds
>  4: has all-talos data for revision a and a-1
>   - sanity check we have the full set of data
>  5: bug filed
>  6: closed (wontfix, backout, fixed)
(Reporter)

Comment 1

3 years ago
A rough state machine suggested by :jmaher


for alert in alerts:
    startRev = getPushLog(alert.rev) - 2
    endRev = getPushLog(alert.rev) + 2
    dataPoints = perfherder.query(alert.branch, alert.platform, alert.test, startRev, endRev)
    switch alert.stage:
        case 0: #new
            if getRevision(alert.rev).merge:
                case = -1
                break
            if alert.branch.endswith('pgo'):
                case = -1
                break
            alert.stage = 1
       case 1: #backfilling
            if len(dataPoints) < 5:
                status = mozci.trigger(alert.buildername, startRev, endRev, times=6)
                if len(status.builds) > 0:
                    alert.stage = 1 # we are waiting on builds, need to run this again
                else:
                    alert.stage = 2
                break
            alert.stage = 2
        case 2: # enough data after initial backfilling, verify
            status = mozci.trigger(alert.buildername, startRev, endRev, times=6)
            if status.builds > 0 or status.pending > 0 or status.running > 0:
                alert.stage = 1 # waiting on builds/tests
                break
               
            if len(dataPoints) < 5:
                alert.stage = -1  # all builds are done, missing jobs for revisions
                break
           
            for data in dataPoints:
                if len(data) < 6:
                    alert.stage = -1 # all builds are done, missing data for jobs
                    break

            # analyze the data, find specific revision:
            pl = getPushLog()
            badRevisions = []
            for rev in pl[startRev:endRev]:
                results = perfherder.compare(pl[rev], pl[rev-1], alert.branch, alert.platform, alert.test)
                if results.change < -2.0:
                    badRevisions.append(rev)

            if len(badRevisions) != 1:
                alert.stage = -1 # too noisy, other issues
                break

            if getRevision(badRevisions[0]).merge:
                case = -1
                break

            if alert.rev != badRevisions[0]:
                alert.rev = badRevisions[0] # we misreported initially, possibly update other tools/status
            alert.stage = 3
        case 3:
            mozci.trigger_all_talos(alert.rev, alert.branch, times=6)
            previous_rev = getPushLog(alert.rev) - 2
            mozci.trigger_all_talos(previous_rev, alert.branch, times=6)
            alert.stage = 4
            break
        case 4:
             # verify all data exists, i.e. jobs are completed
(Reporter)

Updated

3 years ago
Depends on: 1180742
(Reporter)

Updated

3 years ago
Depends on: 1178222
(Reporter)

Comment 2

3 years ago
We had a meeting, and these are some things to take action on: https://etherpad.mozilla.org/perf-backouts
(Reporter)

Updated

3 years ago
Depends on: 1186185
(Reporter)

Updated

3 years ago
Depends on: 1186191
(Reporter)

Updated

3 years ago
Depends on: 1186196
(Reporter)

Updated

3 years ago
Depends on: 1186201
closing out old bugs that haven't been a priority
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.