Bug 1562178 Comment 0 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Ionuț Goldan [:igoldan]

on 2019-06-28 04:14:36 PDT

I think there are multiple approaches to tackle this. I, for one, propose the following one, as it is close to how Code sheriffs supported us, Perf sheriffs.

Define & configure a cronjob which runs every 2 hours. It should identify all new alert summaries and do some retriggering/backfilling on them. It should have a retriggering & backfilling limit, established per day let's say, specified somewhere in the settings.py module.

Summaries which have been handled this way will be skipped on next cron run.

Tricky part of this: alert summaries could very likely contain many, many alerts. R/b-ing all of them would rapidly deplete that limit, so the cronjob should be a bit smarter than this.

It should know how to pick the most relevant alerts (a max of 5 let's say) and then retrigger them. More details on how to do that are provided in the attached document, under the **Retrigger/backfill [2]** section, which some extra mentionings.

The cronjob will target Raptor, Talos & AWSY test frameworks.
It should do a max of 300 retriggers/day (that's the limit a Code sheriff shift had in total)
It should have a special algorithm for picking which alerts to choose from.
This algorithm should consider these priorities, in this precise order:
* regression > improvement
* Windows 10 > Windows 7 > Linux > OSX > Android
* percentage magnitude *(how serious a particular alert is)*
* ideally, each selected alert should originate from a different platform

Revision 1 by

Ionuț Goldan [:igoldan]

on 2019-06-28 04:16:50 PDT

I think there are multiple approaches to tackle this. I, for one, propose the following one, as it is close to how Code sheriffs supported & are still supporting us, Perf sheriffs.

Define & configure a cronjob which runs every 2 hours. It should identify all new alert summaries and do some retriggering/backfilling on them. It should have a retriggering & backfilling limit, established per day let's say, specified somewhere in the settings.py module.

Summaries which have been handled this way will be skipped on next cron run.

Tricky part of this: alert summaries could very likely contain many, many alerts. R/b-ing all of them would rapidly deplete that limit, so the cronjob should be a bit smarter than this.

It should know how to pick the most relevant alerts (a max of 5 let's say) and then retrigger them. More details on how to do that are provided in the attached document, under the **Retrigger/backfill [2]** section, which some extra mentionings.

The cronjob will target Raptor, Talos & AWSY test frameworks.
It should do a max of 300 retriggers/day (that's the limit a Code sheriff shift had in total)
It should have a special algorithm for picking which alerts to choose from.
This algorithm should consider these priorities, in this precise order:
* regression > improvement
* Windows 10 > Windows 7 > Linux > OSX > Android
* percentage magnitude *(how serious a particular alert is)*
* ideally, each selected alert should originate from a different platform

Revision 2 by

Ionuț Goldan [:igoldan]

on 2019-06-28 04:17:51 PDT

I think there are multiple approaches to tackle this. I, for one, propose the following one, as it is close to how Code sheriffs supported & are still supporting us, Perf sheriffs.

Define & configure a cronjob which runs every 2 hours. It should identify all new alert summaries and do some retriggering/backfilling on them. It should have a retriggering & backfilling limit, established per day let's say, specified somewhere in the settings.py module.

Summaries which have been handled this way will be skipped on next cron run.

Tricky part of this: alert summaries could very likely contain many, many alerts. R/b-ing all of them would rapidly deplete that limit, so the cronjob should be a bit smarter than this.

It should know how to pick the most relevant alerts (a max of 5 let's say) and then retrigger them. More details on how to do that are provided in the attached document, under the **Retrigger/backfill [2]** section, which some extra mentionings.

The cronjob will target Raptor, Talos & AWSY test frameworks.
It should do a max of 300 retriggers/day *(that's the limit a full Code sheriff shift had in total)*
It should have a special algorithm for picking which alerts to choose from.
This algorithm should consider these priorities, in this precise order:
* regression > improvement
* Windows 10 > Windows 7 > Linux > OSX > Android
* percentage magnitude *(how serious a particular alert is)*
* ideally, each selected alert should originate from a different platform

Revision 3 by

Ionuț Goldan [:igoldan]

on 2019-06-28 04:26:20 PDT

I think there are multiple approaches to tackle this. I, for one, propose the following one, as it is close to how Code sheriffs supported & are still supporting us, Perf sheriffs.

Define & configure a cronjob which runs every 2 hours. It should identify all new alert summaries and do some retriggering/backfilling on them. It should have a retriggering & backfilling limit, established per day let's say, specified somewhere in the settings.py module.

Summaries which have been handled this way will be skipped on next cron run.

Tricky part of this: alert summaries could very likely contain many, many alerts. R/b-ing all of them would rapidly deplete that limit, so the cronjob should be a bit smarter than this.

It should know how to pick the most relevant alerts (a max of 5 let's say) and then retrigger them. More details on how to do that are provided in the attached document, under the **Retrigger/backfill [2]** section, with some extra mentionings.

The cronjob will target Raptor, Talos & AWSY test frameworks.
It should do a max of 300 retriggers/day *(that's the limit a full Code sheriff shift had in total)*
It should have a special algorithm for picking which alerts to choose from.
This algorithm should consider these priorities, in this precise order:
* regression > improvement
* Windows 10 > Windows 7 > Linux > OSX > Android
* percentage magnitude *(how serious a particular alert is)*
* ideally, each selected alert should originate from a different platform

Revision 4 by

Ionuț Goldan [:igoldan]

on 2019-06-28 04:27:55 PDT

I think there are multiple approaches to tackle this. I, for one, propose the following one, as it is close to how Code sheriffs supported & are still supporting us, Perf sheriffs.

Define & configure a cronjob which runs every 2 hours. It should identify all new alert summaries and do some retriggering/backfilling on them. It should have a retriggering & backfilling limit, established per day let's say, specified somewhere in the settings.py module.

Summaries which have been handled this way will be skipped on next cron run.

Tricky part of this: alert summaries could very likely contain many, many alerts. R/b-ing all of them would rapidly deplete that limit, so the cronjob should be a bit smarter than this.

It should know how to pick the most relevant alerts (a max of 5 let's say) and then retrigger them. More details on how to do that are provided in the attached document, under the **Retrigger/backfill [2]** section, with some extra mentionings.

The cronjob will target Raptor, Talos & AWSY test frameworks.
It should do a max of 600 retriggers per platform per day *(that's the limit a full Code sheriff shift had in total)*
It should have a special algorithm for picking which alerts to choose from.
This algorithm should consider these priorities, in this precise order:
* regression > improvement
* Windows 10 > Windows 7 > Linux > OSX > Android
* percentage magnitude *(how serious a particular alert is)*
* ideally, each selected alert should originate from a different platform

Back to Bug 1562178 Comment 0

Bugzilla

Quick Search

Bug 1562178 Comment 0 Edit History