Open Bug 1562178 Opened 1 year ago Updated 5 days ago

[meta] Automate backfills and retriggers for alerts

Categories

(Tree Management :: Perfherder, task, P3)

Tracking

(Not tracked)

People

(Reporter: igoldan, Unassigned)

References

(Depends on 4 open bugs, Blocks 1 open bug)

Details

(Keywords: meta)

Attachments

(1 file)

I think there are multiple approaches to tackle this. I, for one, propose the following one, as it is close to how Code sheriffs supported & are still supporting us, Perf sheriffs.

Define & configure a cronjob which runs every 2 hours. It should identify all new alert summaries and do some retriggering/backfilling on them. It should have a retriggering & backfilling limit, established per day let's say, specified somewhere in the settings.py module.

Summaries which have been handled this way will be skipped on next cron run.

Tricky part of this: alert summaries could very likely contain many, many alerts. R/b-ing all of them would rapidly deplete that limit, so the cronjob should be a bit smarter than this.

It should know how to pick the most relevant alerts (a max of 5 let's say) and then retrigger them. More details on how to do that are provided in the attached document, under the Retrigger/backfill [2] section, with some extra mentionings.

The cronjob will target Raptor, Talos & AWSY test frameworks.
It should do a max of 600 retriggers per platform per day (that's the limit a full Code sheriff shift had in total)
It should have a special algorithm for picking which alerts to choose from.
This algorithm should consider these priorities, in this precise order:

  • regression > improvement
  • Windows 10 > Windows 7 > Linux > OSX > Android
  • percentage magnitude (how serious a particular alert is)
  • ideally, each selected alert should originate from a different platform

a few questions:

  • if this is a cron job, how do we know if this wasn't done manually?
  • will there be any manual bisection expected
  • what if there was jobs backfilled from a previous cronjob? Will we double backfill and generate 2x as much data - how can we prevent that?
  • what about inbound vs autoland? what about beta alerts?

I think we can limit this to:

  • regressions only (not improvements)
  • all platforms, but maybe limit android to 100/day
  • 5%+ regressions only
  • pick up to 3 jobs to backfill from a given summary

I think it wouldn't be hard to execute a taskcluster backfill action for a given job/revision from a cronjob or from inside of treeherder when we generate the alerts. I like the cronjob better as it gives a chance to allow for other alerts to come in and help focus on specific alerts instead of picking up every random alert. An hourly cron that looks at alerts generated at least an hour ago that haven't issued a backfill. Ideally marking the alerts table to indicate a backfill was issued.

A few exceptions I can think of which will cause missing data and require manual backfilling/work:

  • backfill job fails to execute (timed out, infra, bad params)
  • builds/jobs fail to execute, net result no data

Before doing this, having metrics of number of jobs executed as backfill would be useful to know- then we could track the cost of backfilling and efficiency of a script doing it vs a human.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #1)

I think we can limit this to:

  • regressions only (not improvements)
  • all platforms, but maybe limit android to 100/day
  • 5%+ regressions only
  • pick up to 3 jobs to backfill from a given summary

I think it wouldn't be hard to execute a taskcluster backfill action for a given job/revision from a cronjob or from inside of treeherder when we generate the alerts. I like the cronjob better as it gives a chance to allow for other alerts to come in and help focus on specific alerts instead of picking up every random alert. An hourly cron that looks at alerts generated at least an hour ago that haven't issued a backfill. Ideally marking the alerts table to indicate a backfill was issued.

These are all excellent suggestions, thanks Joel!

Before doing this, having metrics of number of jobs executed as backfill would be useful to know- then we could track the cost of backfilling and efficiency of a script doing it vs a human.

Agreed, we've already started to look into this. I'm going to move these tasks out of Jira and into Bugzilla for better visibility and collaboration.

Priority: -- → P3
Assignee: nobody → igoldan
Status: NEW → ASSIGNED
Priority: P3 → P1
Depends on: 1570225
Depends on: 1570226
Keywords: meta
Summary: Automate backfill and retriggers when regressions are detected → [meta] Automate backfill and retriggers when regressions are detected
Depends on: 1570952
Depends on: 1571363
Depends on: 1571366
Depends on: 1571369
Depends on: 1571372
No longer depends on: 1571369
No longer depends on: 1571366
Summary: [meta] Automate backfill and retriggers when regressions are detected → [meta] Automate backfill and retrigger when regressions are detected
No longer depends on: 1571372
No longer depends on: 1571363
Assignee: igoldan → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
Summary: [meta] Automate backfill and retrigger when regressions are detected → [meta] Automate backfills and retriggers for alerts
You need to log in before you can comment on or make changes to this bug.