Open Bug 1562178 Opened 5 years ago Updated 2 years ago

[meta] Automate backfills and retriggers for alerts

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: igoldan, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: meta)

Attachments

(1 file)

Code & Perf Sheriff Interaction.pdf 5 years ago Ionuț Goldan [:igoldan] 89.50 KB, application/pdf		Details

Ionuț Goldan [:igoldan]

Reporter

Description

•

5 years ago

•

Edited

Attached file Code & Perf Sheriff Interaction.pdf — Details

I think there are multiple approaches to tackle this. I, for one, propose the following one, as it is close to how Code sheriffs supported & are still supporting us, Perf sheriffs.

Define & configure a cronjob which runs every 2 hours. It should identify all new alert summaries and do some retriggering/backfilling on them. It should have a retriggering & backfilling limit, established per day let's say, specified somewhere in the settings.py module.

Summaries which have been handled this way will be skipped on next cron run.

Tricky part of this: alert summaries could very likely contain many, many alerts. R/b-ing all of them would rapidly deplete that limit, so the cronjob should be a bit smarter than this.

It should know how to pick the most relevant alerts (a max of 5 let's say) and then retrigger them. More details on how to do that are provided in the attached document, under the Retrigger/backfill [2] section, with some extra mentionings.

The cronjob will target Raptor, Talos & AWSY test frameworks.
It should do a max of 600 retriggers per platform per day (that's the limit a full Code sheriff shift had in total)
It should have a special algorithm for picking which alerts to choose from.
This algorithm should consider these priorities, in this precise order:

regression > improvement
Windows 10 > Windows 7 > Linux > OSX > Android
percentage magnitude (how serious a particular alert is)
ideally, each selected alert should originate from a different platform

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

5 years ago

a few questions:

if this is a cron job, how do we know if this wasn't done manually?
will there be any manual bisection expected
what if there was jobs backfilled from a previous cronjob? Will we double backfill and generate 2x as much data - how can we prevent that?
what about inbound vs autoland? what about beta alerts?

I think we can limit this to:

regressions only (not improvements)
all platforms, but maybe limit android to 100/day
5%+ regressions only
pick up to 3 jobs to backfill from a given summary

I think it wouldn't be hard to execute a taskcluster backfill action for a given job/revision from a cronjob or from inside of treeherder when we generate the alerts. I like the cronjob better as it gives a chance to allow for other alerts to come in and help focus on specific alerts instead of picking up every random alert. An hourly cron that looks at alerts generated at least an hour ago that haven't issued a backfill. Ideally marking the alerts table to indicate a backfill was issued.

A few exceptions I can think of which will cause missing data and require manual backfilling/work:

backfill job fails to execute (timed out, infra, bad params)
builds/jobs fail to execute, net result no data

Before doing this, having metrics of number of jobs executed as backfill would be useful to know- then we could track the cost of backfilling and efficiency of a script doing it vs a human.