[meta] Automate backfills and retriggers for alerts
Categories
(Tree Management :: Perfherder, task, P3)
Tracking
(Not tracked)
People
(Reporter: igoldan, Unassigned)
References
(Blocks 2 open bugs)
Details
(Keywords: meta)
Attachments
(1 file)
89.50 KB,
application/pdf
|
Details |
I think there are multiple approaches to tackle this. I, for one, propose the following one, as it is close to how Code sheriffs supported & are still supporting us, Perf sheriffs.
Define & configure a cronjob which runs every 2 hours. It should identify all new alert summaries and do some retriggering/backfilling on them. It should have a retriggering & backfilling limit, established per day let's say, specified somewhere in the settings.py module.
Summaries which have been handled this way will be skipped on next cron run.
Tricky part of this: alert summaries could very likely contain many, many alerts. R/b-ing all of them would rapidly deplete that limit, so the cronjob should be a bit smarter than this.
It should know how to pick the most relevant alerts (a max of 5 let's say) and then retrigger them. More details on how to do that are provided in the attached document, under the Retrigger/backfill [2] section, with some extra mentionings.
The cronjob will target Raptor, Talos & AWSY test frameworks.
It should do a max of 600 retriggers per platform per day (that's the limit a full Code sheriff shift had in total)
It should have a special algorithm for picking which alerts to choose from.
This algorithm should consider these priorities, in this precise order:
- regression > improvement
- Windows 10 > Windows 7 > Linux > OSX > Android
- percentage magnitude (how serious a particular alert is)
- ideally, each selected alert should originate from a different platform
Comment 1•5 years ago
|
||
a few questions:
- if this is a cron job, how do we know if this wasn't done manually?
- will there be any manual bisection expected
- what if there was jobs backfilled from a previous cronjob? Will we double backfill and generate 2x as much data - how can we prevent that?
- what about inbound vs autoland? what about beta alerts?
I think we can limit this to:
- regressions only (not improvements)
- all platforms, but maybe limit android to 100/day
- 5%+ regressions only
- pick up to 3 jobs to backfill from a given summary
I think it wouldn't be hard to execute a taskcluster backfill action for a given job/revision from a cronjob or from inside of treeherder when we generate the alerts. I like the cronjob better as it gives a chance to allow for other alerts to come in and help focus on specific alerts instead of picking up every random alert. An hourly cron that looks at alerts generated at least an hour ago that haven't issued a backfill. Ideally marking the alerts table to indicate a backfill was issued.
A few exceptions I can think of which will cause missing data and require manual backfilling/work:
- backfill job fails to execute (timed out, infra, bad params)
- builds/jobs fail to execute, net result no data
Before doing this, having metrics of number of jobs executed as backfill would be useful to know- then we could track the cost of backfilling and efficiency of a script doing it vs a human.
Comment 2•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #1)
I think we can limit this to:
- regressions only (not improvements)
- all platforms, but maybe limit android to 100/day
- 5%+ regressions only
- pick up to 3 jobs to backfill from a given summary
I think it wouldn't be hard to execute a taskcluster backfill action for a given job/revision from a cronjob or from inside of treeherder when we generate the alerts. I like the cronjob better as it gives a chance to allow for other alerts to come in and help focus on specific alerts instead of picking up every random alert. An hourly cron that looks at alerts generated at least an hour ago that haven't issued a backfill. Ideally marking the alerts table to indicate a backfill was issued.
These are all excellent suggestions, thanks Joel!
Before doing this, having metrics of number of jobs executed as backfill would be useful to know- then we could track the cost of backfilling and efficiency of a script doing it vs a human.
Agreed, we've already started to look into this. I'm going to move these tasks out of Jira and into Bugzilla for better visibility and collaboration.
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Description
•