Closed Bug 1613925 Opened 4 years ago Closed 3 years ago

Prepare credentials for backfill bot

Categories

(Tree Management :: Perfherder, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: igoldan, Assigned: mtabara)

References

Details

Attachments

(1 file, 1 obsolete file)

The bot we plan to define in bug 1612547 will definitely require Taskcluster credentials, otherwise it won't be able to request any backfills for our perf jobs.

At the moment, we're able to simulate this ability using the clientId & access token from our own Taskcluster accounts. But we agree this is not ideal.

We need to create dedicated credentials for the bot. Currently, I'm not sure what's the best approach out of these two:

  • should we require one of our current Taskcluster accounts to be granted the ability to create client credentials? (I'm referring to accounts with Perf sheriff-specific scopes)
  • or should we require a standalone client credential (via a Bugzilla bug maybe)?

Dustin, could I grab an answer from you on this?

Flags: needinfo?(dustin)

This is a better question for release engineering. I suspect your first approach is closer to the desired process.

Flags: needinfo?(dustin) → needinfo?(mozilla)

Or maybe Jordan could also?

Flags: needinfo?(jlund)

(In reply to Ionuț Goldan [:igoldan] from comment #3)

Or maybe Jordan could also?

Ionuț Goldan: could you provide some context into what you are currently doing vs what you tend to do with the bot? What will automatically be rerun? I'd like to get a sense of the net change in CI load this bot would bring. Tom can help you with creds/clients

Flags: needinfo?(jlund)
Depends on: 1616263
Assignee: nobody → gmierz2

Greg, once you sync with Jordan, you can reassign this to me.

Removing ni? pending discussion of scope.

Flags: needinfo?(mozilla)

The automatic backfill algorithm is as follows.

For every performance alert summary (i.e.), we cherry pick a max of 5 alerts. (Even if that summary has 50 of them!)
Then, we generate a backfill record for each of the cherry picked alerts.
A backfill record basically stores a max of 5 perf jobs, which correspond to the suspect time range for when a perf change happened (regression or improvement).
These jobs are used as guidelines for performing the actual backfill. We'll backfill between them.
E.g. for 5 jobs, we'll do 4 backfills. For 3 jobs, we'll do 2 backfills.

Moving on, I'm providing some rough cost estimations, in terms of tasks triggered by the automated backfills.

last week      had 200 records  = 200
last 2 weeks had 340 records  = 170 (per week)
last month   had 660 records  = 165 (per week)
last quarter had 1,500 records = 125 (per week)
average                                      = 165 (per week)

each record would equivalate to ~4 backfills
each backfill would equivalate to 1 decision task + max 9 build tasks + 9 perf tasks

During a week, that would mean on average:
WEEKLY_TASK_AVERAGE = 165 records * 4 backfills * (1 decision task + 9 build tasks + 9 perf tasks)

Ignoring the decision tasks, as they are quick on exec time, gives us:

WEEKLY_TASK_AVERAGE = 660 * 9 build tasks + 660 * 9 perf tasks
                                            = 6,120 build tasks + 6,120 perf tasks

each build tasks takes ~30 minutes to finish (but I believe Jordan or someone from the build team have a better estimation for this)
each perf tasks takes ~20 minutes to finish

currently builds are guaranteed to exist, if they do not exist it is either an intermittent issue, or they are broken - so your total time incurred seems great.

Can you cross reference this with actual work done by perf sheriffs in reality? I assume humans might be more efficient (which is ok), but it is good to know what we see for number of jobs.

questions:

  1. Is this for a specific platform?
  2. Can you explain what a backfill is?
  3. Can you explain why the need for ~4 backfills?
  4. are there no retriggers as part of this?

I only ask so I can help provide context to all in concrete terms without ambiguity.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #8)

[...]
questions:

  1. Is this for a specific platform?

My cost estimations weren't platform specific. But this is a good point. For 2020/Q1, we plan to backfill only on Linux.
Gut feeling: this would further reduce the above costs by ~50%.

  1. Can you explain what a backfill is?

It's an AC(Bk) task.

  1. Can you explain why the need for ~4 backfills?

That's the general suspect range perf sheriffs tend to backfill/retrigger.

  1. are there no retriggers as part of this?

Nope. For 2020/Q1 we're only planning to add automated backfilling.

[...]

thanks for the update. To clarify, for a given alert on linux, backfill bot would look 2 revisions forward in history and 2 revisions backwards in history and fill in the holes to have a 40 revision window of complete data.

On linux, I count 120 jobs run for talos/raptor (I assume that is the scope of this, please correct me if not) on autoland- these run every 10th push. Simple math dictates that if our volume is 700 commits/week then:
max load = 120*700 = 84000

as we currently run every 10th push (but on slow times more frequently), I will say default load is every 9th push:
default load = 120*700/9 = 9333

doing a rough query, I see 1430 backfill jobs in total on autoland for January and February- I assume some are for unittests as well and this would be for all platforms; this helps validate your math.

Can you cross reference this with actual work done by perf sheriffs in reality? I assume humans might be more efficient (which is ok), but it is good to know what we see for number of jobs.

This is the main thing I am interested. what would be the delta change from sheriffing manual work vs what this bot would do. At any rate, if this bot has dials that can be more and less aggressive at backfilling, I don't want to block now and we can proceed. Feel free to reach out to Tom

We're still working on the query script that shows the actual figures.
Currently, the perf sheriffs as a whole would roughly perform 3,000 build tasks + 15,000 perf tasks every week.
The delta (the bot will perform) should be: 3,000 build tasks + 3,000 perf tasks every week.
The other 12,000 perf tasks will retrigered manually by the perf sheriffs.

We do have dials that can be more and less aggressive at backfilling.
We've expressed them in harcoded limits. If the bot exceeds a limit, it won't backfill anymore.

I mentioned in comment 8 that builds are going to exist, can you outline where builds won't exist?

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #13)

I mentioned in comment 8 that builds are going to exist, can you outline where builds won't exist?

Oh, I think I didn't read that properly. Cool! That means the bot's delta will only perform 3,000 perf tasks.

(In reply to Ionuț Goldan [:igoldan] from comment #12)

We're still working on the query script that shows the actual figures.
Currently, the perf sheriffs as a whole would roughly perform 3,000 build tasks + 15,000 perf tasks every week.
The delta (the bot will perform) should be: 3,000 build tasks + 3,000 perf tasks every week.
The other 12,000 perf tasks will retrigered manually by the perf sheriffs.

We do have dials that can be more and less aggressive at backfilling.
We've expressed them in harcoded limits. If the bot exceeds a limit, it won't backfill anymore.

this to me reads that we won't be adding much additional load. merely offloading manual work that sheriffs do. For the most part. If I have that right, works for me. Feel free to get creds off Tom

(In reply to Jordan Lund (:jlund) from comment #15)

[...]
this to me reads that we won't be adding much additional load. merely offloading manual work that sheriffs do. For the most part. If I have that right, works for me. Feel free to get creds off Tom

That's entirely correct.

Tom, could you help me setup the credentials? Out of the 2 options highlighted in comment 0, could we go with the 1st option, as Dustin hinted?
That is:

require one of our current Taskcluster accounts to be granted the ability to create client credentials? (I'm referring to accounts with Perf sheriff-specific scopes)

Flags: needinfo?(mozilla)
Assignee: gmierz2 → igoldan
Assignee: igoldan → airimovici
Assignee: airimovici → igoldan

For testing, individual credentials can be used. Once this is ready to be deployed, I can create the appropriate clients.

Flags: needinfo?(mozilla)
Assignee: igoldan → nobody
Priority: P2 → P3
Assignee: nobody → mtabara
Status: NEW → ASSIGNED
Attachment #9217204 - Attachment is obsolete: true
Pushed by mtabara@mozilla.com:
https://hg.mozilla.org/ci/ci-configuration/rev/63beab494973
add Treeherder Sheriffs bot client. r=releng-reviewers,jmaher

The bot landed and the credentials have been moved into Heroku env vars. I think we're done here, please reopen if I'm wrong.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: