Closed Bug 1613925 Opened 4 years ago Closed 3 years ago

Prepare credentials for backfill bot

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: igoldan, Assigned: mtabara)

References

Details

Attachments

(1 file, 1 obsolete file)

Bug 1613925 - add Treeherder Sheriffs bot client. r=#releng 3 years ago Mihai Tabara [:mtabara]⌚️GMT 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1613925 - address reviewing comments. r=#releng 3 years ago Mihai Tabara [:mtabara]⌚️GMT 48 bytes, text/x-phabricator-request		Details \| Review

Ionuț Goldan [:igoldan]

Reporter

Description

•

4 years ago

•

Edited

The bot we plan to define in bug 1612547 will definitely require Taskcluster credentials, otherwise it won't be able to request any backfills for our perf jobs.

At the moment, we're able to simulate this ability using the clientId & access token from our own Taskcluster accounts. But we agree this is not ideal.

We need to create dedicated credentials for the bot. Currently, I'm not sure what's the best approach out of these two:

should we require one of our current Taskcluster accounts to be granted the ability to create client credentials? (I'm referring to accounts with Perf sheriff-specific scopes)
or should we require a standalone client credential (via a Bugzilla bug maybe)?

Ionuț Goldan [:igoldan]

Reporter

Comment 1

•

4 years ago

Dustin, could I grab an answer from you on this?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

4 years ago

This is a better question for release engineering. I suspect your first approach is closer to the desired process.

Flags: needinfo?(dustin) → needinfo?(mozilla)

Ionuț Goldan [:igoldan]

Reporter

Comment 3

•

4 years ago

Or maybe Jordan could also?

Flags: needinfo?(jlund)

Jordan Lund (:jlund)

Comment 4

•

4 years ago

(In reply to Ionuț Goldan [:igoldan] from comment #3)

Or maybe Jordan could also?

Ionuț Goldan: could you provide some context into what you are currently doing vs what you tend to do with the bot? What will automatically be rerun? I'd like to get a sense of the net change in CI load this bot would bring. Tom can help you with creds/clients

Flags: needinfo?(jlund)

Ionuț Goldan [:igoldan]

Reporter

Updated

•

4 years ago

Depends on: 1616263

Ionuț Goldan [:igoldan]

Reporter

Updated

•

4 years ago

Assignee: nobody → gmierz2

Ionuț Goldan [:igoldan]

Reporter

Comment 5

•

4 years ago

Greg, once you sync with Jordan, you can reassign this to me.

Tom Prince [:tomprince]

Comment 6

•

4 years ago

Removing ni? pending discussion of scope.

Flags: needinfo?(mozilla)

Ionuț Goldan [:igoldan]

Reporter

Comment 7

•

4 years ago

•

Edited

The automatic backfill algorithm is as follows.

For every performance alert summary (i.e.), we cherry pick a max of 5 alerts. (Even if that summary has 50 of them!)
Then, we generate a backfill record for each of the cherry picked alerts.
A backfill record basically stores a max of 5 perf jobs, which correspond to the suspect time range for when a perf change happened (regression or improvement).
These jobs are used as guidelines for performing the actual backfill. We'll backfill between them.
E.g. for 5 jobs, we'll do 4 backfills. For 3 jobs, we'll do 2 backfills.

Moving on, I'm providing some rough cost estimations, in terms of tasks triggered by the automated backfills.

last week      had 200 records = 200
last 2 weeks had 340 records = 170 (per week)
last month   had 660 records = 165 (per week)
last quarter had 1,500 records = 125 (per week)
average                                      = 165 (per week)

each record would equivalate to ~4 backfills
each backfill would equivalate to 1 decision task + max 9 build tasks + 9 perf tasks

During a week, that would mean on average:
WEEKLY_TASK_AVERAGE = 165 records * 4 backfills * (1 decision task + 9 build tasks + 9 perf tasks)

Ignoring the decision tasks, as they are quick on exec time, gives us:

WEEKLY_TASK_AVERAGE = 660 * 9 build tasks + 660 * 9 perf tasks
= 6,120 build tasks + 6,120 perf tasks

each build tasks takes ~30 minutes to finish (but I believe Jordan or someone from the build team have a better estimation for this)
each perf tasks takes ~20 minutes to finish

Joel Maher ( :jmaher ) (UTC -8)

Comment 8

•

4 years ago

currently builds are guaranteed to exist, if they do not exist it is either an intermittent issue, or they are broken - so your total time incurred seems great.

Can you cross reference this with actual work done by perf sheriffs in reality? I assume humans might be more efficient (which is ok), but it is good to know what we see for number of jobs.

questions:

Is this for a specific platform?
Can you explain what a backfill is?
Can you explain why the need for ~4 backfills?
are there no retriggers as part of this?

I only ask so I can help provide context to all in concrete terms without ambiguity.

Ionuț Goldan [:igoldan]

Reporter

Comment 9

•

4 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #8)

[...]
questions:

Is this for a specific platform?

My cost estimations weren't platform specific. But this is a good point. For 2020/Q1, we plan to backfill only on Linux.
Gut feeling: this would further reduce the above costs by ~50%.

Can you explain what a backfill is?

It's an AC(Bk) task.

Can you explain why the need for ~4 backfills?

That's the general suspect range perf sheriffs tend to backfill/retrigger.

are there no retriggers as part of this?

Nope. For 2020/Q1 we're only planning to add automated backfilling.

[...]

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

4 years ago

thanks for the update. To clarify, for a given alert on linux, backfill bot would look 2 revisions forward in history and 2 revisions backwards in history and fill in the holes to have a 40 revision window of complete data.

On linux, I count 120 jobs run for talos/raptor (I assume that is the scope of this, please correct me if not) on autoland- these run every 10th push. Simple math dictates that if our volume is 700 commits/week then:
max load = 120*700 = 84000

as we currently run every 10th push (but on slow times more frequently), I will say default load is every 9th push:
default load = 120*700/9 = 9333

doing a rough query, I see 1430 backfill jobs in total on autoland for January and February- I assume some are for unittests as well and this would be for all platforms; this helps validate your math.

Jordan Lund (:jlund)

Comment 11

•

4 years ago

Can you cross reference this with actual work done by perf sheriffs in reality? I assume humans might be more efficient (which is ok), but it is good to know what we see for number of jobs.

This is the main thing I am interested. what would be the delta change from sheriffing manual work vs what this bot would do. At any rate, if this bot has dials that can be more and less aggressive at backfilling, I don't want to block now and we can proceed. Feel free to reach out to Tom

Ionuț Goldan [:igoldan]

Reporter

Comment 12

•

4 years ago

•

Edited

We're still working on the query script that shows the actual figures.
Currently, the perf sheriffs as a whole would roughly perform 3,000 build tasks + 15,000 perf tasks every week.
The delta (the bot will perform) should be: 3,000 build tasks + 3,000 perf tasks every week.
The other 12,000 perf tasks will retrigered manually by the perf sheriffs.

We do have dials that can be more and less aggressive at backfilling.
We've expressed them in harcoded limits. If the bot exceeds a limit, it won't backfill anymore.

Joel Maher ( :jmaher ) (UTC -8)

Comment 13

•

4 years ago

I mentioned in comment 8 that builds are going to exist, can you outline where builds won't exist?

Ionuț Goldan [:igoldan]

Reporter

Comment 14

•

4 years ago

•

Edited

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #13)

I mentioned in comment 8 that builds are going to exist, can you outline where builds won't exist?

Oh, I think I didn't read that properly. Cool! That means the bot's delta will only perform 3,000 perf tasks.

Jordan Lund (:jlund)

Comment 15

•

4 years ago

(In reply to Ionuț Goldan [:igoldan] from comment #12)

We're still working on the query script that shows the actual figures.
Currently, the perf sheriffs as a whole would roughly perform 3,000 build tasks + 15,000 perf tasks every week.
The delta (the bot will perform) should be: 3,000 build tasks + 3,000 perf tasks every week.
The other 12,000 perf tasks will retrigered manually by the perf sheriffs.

We do have dials that can be more and less aggressive at backfilling.
We've expressed them in harcoded limits. If the bot exceeds a limit, it won't backfill anymore.

this to me reads that we won't be adding much additional load. merely offloading manual work that sheriffs do. For the most part. If I have that right, works for me. Feel free to get creds off Tom

Ionuț Goldan [:igoldan]

Reporter

Comment 16

•

4 years ago

(In reply to Jordan Lund (:jlund) from comment #15)

[...]
this to me reads that we won't be adding much additional load. merely offloading manual work that sheriffs do. For the most part. If I have that right, works for me. Feel free to get creds off Tom

That's entirely correct.

Ionuț Goldan [:igoldan]

Reporter

Comment 17

•

4 years ago

Tom, could you help me setup the credentials? Out of the 2 options highlighted in comment 0, could we go with the 1st option, as Dustin hinted?
That is:

require one of our current Taskcluster accounts to be granted the ability to create client credentials? (I'm referring to accounts with Perf sheriff-specific scopes)

Flags: needinfo?(mozilla)

Ionuț Goldan [:igoldan]

Reporter

Updated

•

4 years ago

Assignee: gmierz2 → igoldan

Alexandru Irimovici

Updated

•

4 years ago

Assignee: igoldan → airimovici

Ionuț Goldan [:igoldan]

Reporter

Updated

•

4 years ago

Assignee: airimovici → igoldan

Tom Prince [:tomprince]

Comment 18

•

4 years ago

For testing, individual credentials can be used. Once this is ready to be deployed, I can create the appropriate clients.

Flags: needinfo?(mozilla)

Ionuț Goldan [:igoldan]

Reporter

Updated

•

4 years ago

Assignee: igoldan → nobody

Priority: P2 → P3

Mihai Tabara [:mtabara]⌚️GMT

Assignee

Comment 19

•

3 years ago

Attached file Bug 1613925 - add Treeherder Sheriffs bot client. r=#releng — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → mtabara

Status: NEW → ASSIGNED

Mihai Tabara [:mtabara]⌚️GMT

Assignee

Comment 20

•

3 years ago

Attached file Bug 1613925 - address reviewing comments. r=#releng (obsolete) — Details

Phabricator Automation

Updated

•

3 years ago

Attachment #9217204 - Attachment is obsolete: true

Pulsebot

Comment 21

•

3 years ago

Pushed by mtabara@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/63beab494973
add Treeherder Sheriffs bot client. r=releng-reviewers,jmaher

Mihai Tabara [:mtabara]⌚️GMT

Assignee

Comment 22

•

3 years ago

The bot landed and the credentials have been moved into Heroku env vars. I think we're done here, please reopen if I'm wrong.

Mihai Tabara [:mtabara]⌚️GMT

Assignee

Updated

•

3 years ago

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.