Open Bug 1562162 Opened 4 months ago Updated 20 days ago

Measure machine time of retriggered/backfilled jobs

Categories

(Tree Management :: Perfherder, task, P3)

Tracking

(Not tracked)

People

(Reporter: igoldan, Unassigned)

References

(Depends on 2 open bugs, Blocks 2 open bugs)

Details

(Keywords: meta)

Attachments

(1 file)

We need to know how much CPU time our retriggers & backfills consume, for the Talos, Raptor & AWSY frameworks.
By "our", I refer to us, Perf sheriffs. Explicitly, this means:

  • igoldan
  • bebe
  • marauder
  • alexandrui

Code sheriffs' activity should be excluded from this query, as they're also doing intermittent monitoring via retriggers/backfills.

I believe the best way to define a query for this would be to use Mozilla's ActiveData recipes, which can be found here. Kyle Lahnakoski could assist you in adding it to the repo. Sarah Clements could provide more insights over the kind of data Treeherder stores & Joel Maher could provide more insights over related TaskCluster data, in case we need to query that too, so we have all pieces together. I believe the data is fragmented over these 2 projects.

An example would shed some light over why we need this.
Given an alert, let's say alert #20770, if we go to it's graph and zoom in here, we'll be able to see multiple commits with data points stacking one on top of the other.
The associated Treeherder job view (which is this one) pretty clearly shows some of the commits which have multiple Rap(tp6-6) jobs. These are the jobs that produced the data we observe in the graph.
As each job is triggered by a specific user + mentions how much time it needed to run, we need to sum this kind of durations up.

To this sum, we then need to add the backfill time (when you select a job, click on the three dots from bottom-left, then click on "Backfill") of all backfills triggered by Perf sheriffs. It's likely some backfill jobs have other dependant jobs that need to run first. We should take those into account too, for extra precision.

I don't understand the specifics of perf sheriffs vs everyone else? why not sum up all non default perf jobs and break it down by user?

Assignee: nobody → ariakab

Hello!

This is my first attempt to write a query based on what I understood from the requirements and the information I could gather.

https://sql.telemetry.mozilla.org/queries/63517

This query sums the durations of the jobs related to a certain push, grouping them by user and framework.

I know that some aspects were not taken into consideration but I couldn't figure out the following:

  • how can I know which job is a retrigger, a backfill or a "default"
  • for backfills, how can I get the jobs that they depend on
Flags: needinfo?(jmaher)

I am confused by your output as talos says 4 minutes, but I see many minutes used up on these talos jobs:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=7641c71557dba7ab8f76d7d90ef332895d5f0998&searchStr=talos&selectedJob=248633605

I would expect talos to have significantly more than 4 minutes- maybe 4 minutes is the average or max value.

To answer your questions, they are both hard.

For a 'default' job this is something that is scheduled by default. you could download the task-graph that is generated as part of the original decision task and parse that list and determine what was scheduled by default. That would be the ultimate source.

Determining a backfill might be similar in that there exists a decision task job called 'backfill' and that fills in jobs for many previous revisions. The key here is a job is added after the build is completed. Looking at taskcluster proper will give you more details.

For example here is a backfill:
https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=backfill&tochange=f4081c5e99bf3f47f82f70b09edc7b94cc243841&group_state=expanded

this backfills android debug wr3:
https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=test-android-em-7.0-x86_64%2Fopt-web-platform-tests-reftests-e10s-3&tochange=f4081c5e99bf3f47f82f70b09edc7b94cc243841

if you look in treeherder the previous pushes do not have decision tasks, but they have extra jobs. Those jobs are owned by the original push author, not the person who initiated the retrigger/backfill.

If you dig in deeper and look at the taskcluster details:
https://tools.taskcluster.net/groups/KY55bYPzQym_Vf25lPUTdA/tasks/Zgjug0bsQKy341e10-538g/details

you can see a dependency list and that dependency list will say the build it depends on as well as the action task job (backfill or retrigger). Now clicking on the backfill job, you can see who scheduled it:
https://tools.taskcluster.net/groups/Atc0VTYHSkqkj6kAyFJ9RQ/tasks/A-_sVN39S3azLKAllY5rNg/details

in this case csabou did :)

I think the next logical question is how to get this information programatically. I am not sure if the taskcluster information is easy to access via re-dash, maybe there is additional information in treeherder tables indicatinga backfill or retrigger request, as it stands I am not that familiar with the table structures.

In the past I have done something similar with an active-data-recipe:
https://github.com/jmaher/active-data-recipes/blob/master/recipes/seta_accuracy.py

this is limited in the sense that it takes the dependent build and assumes we schedule jobs in a really short time window after the build finishes. While that is true most of the time, what if a build is scheduled as a result of a backfill? That recipe doesn't distinguish between retrigger and backfill.

If you find that there is information in taskcluster but not in re-dash or activedata, then please ask that the taskcluster data be added to activedata- we already ingest most of the taskcluster data, it is probably just some additional fields needed to ingest.

Flags: needinfo?(jmaher)
Status: NEW → ASSIGNED
Priority: -- → P1

I'm not a big power user of Treeherder and I'm also pretty new to the project. The UI is still confusing me and I clearly need a better understanding of both Treeherder and Taskcluster in order to achieve the purpose of this task.

Information provided so far is not quite clear for me and leaves me guessing how things might happen, not necessarily understanding how they really happen and that's really inefficient and mentally exhausting.

I could really use a call with somebody that knows how Treeherder and Taskcluster work rather then trying to understand the tools through reading comments and getting confused with certain Treeherder/Taskcluster urls.

Otherwise, maybe there is someone else that can easily tackle this task.

Flags: needinfo?(jmaher)

:davehunt, can you walk through the Treeherder UI and viewing logs/tasks with Arnold to get him up to speed on things. If there are more detailed data access questions I am happy to meet and do a brain dump.

Flags: needinfo?(jmaher) → needinfo?(dave.hunt)

Kyle: Would you be able to help with an ActiveData query to identify the amount of time currently spent on retriggers and backfills? These can be separate queries, as it sounds like determining the cost of backfills is more complex. If we don't have the necessary data, is this something we can add, so that we'll be able to quantify this goin forward?

Flags: needinfo?(dave.hunt) → needinfo?(klahnakoski)

:davehunt - I would need to know more about the process of retriggers. Ask :camd or :armenzg, or :ahal to find out who knows more about the retrigger logic: How do retrigger jobs get scheduled? What markup do retrigger jobs have? (how can we distinguish them from other jobs?) I am sure there is a property value in Treeherder, or Taskcluster, that allows us to recognize backfills; we just need to find it.

I did a weak attempt at this with arno__ on a call the other week, but could not find such a property.

Flags: needinfo?(klahnakoski)

(In reply to Kyle Lahnakoski [:ekyle] from comment #7)

:davehunt - I would need to know more about the process of retriggers. Ask :camd or :armenzg, or :ahal to find out who knows more about the retrigger logic: How do retrigger jobs get scheduled? What markup do retrigger jobs have? (how can we distinguish them from other jobs?) I am sure there is a property value in Treeherder, or Taskcluster, that allows us to recognize backfills; we just need to find it.

I did a weak attempt at this with arno__ on a call the other week, but could not find such a property.

Hi Kyle. Based on Joel's guidelines from comment 3, I've poked around with Treeherder's UI and came to the same conclusion as he did.
I only needed to run SQL queries on Treeherder's database & play around Taskcluster's UI to check that .

Both perf jobs & retrigger/backfill jobs have associated Taskcluster ids, via the taskcluster_metadata table.

For retrigger jobs (such as AC(rt)), there's a parent - child relation, where the retrigger job acts as the parent and the perf job as the child, using the associated Taskcluster ids. Taskcluster is the service that's keeping the record with these relations.

We're able to find all perf jobs & all retrigger/backfill jobs. But we're missing the relation records Taskcluster has. These can allow us to filter only the perf jobs which got retriggered or backfilled.

To me, this sounds like ActiveData recipes remains the way to perform this query.
Another approach would be to simply link the Taskcluster dependency id to Treeherder's jobs. But I'm not sure about the implementation details of this. :camd does this sound like an easy implementation task? Maybe adding a table column & updating an ingestion worker?

Either way, I want to split this task in 2, for start.

Flags: needinfo?(cdawson)
Depends on: 1569165
Keywords: meta
Depends on: 1569166

Ionut: I was thinking of this very thing. The job table is huge, so the migration to add a field would be significant. But this would be a good thing to do. So we may want to take that hit and do it over a weekend sometime.

Currently, I added a Treeherder API called /taskclustermetadata/ where you can pass in a comma-separated list of job ids and it will give the task_id and retry_id for each.

But, yes, you're correct that we could add the fields to the job table, then update the ingestion task to populate it. We would also run a SET query to backfill the existing job fields.

We may even have had some other data migrations we wanted to do to the job table at the same time. I'd have to checkout bugzilla to see if we can find them. Looking at the DB, I think we can remove project_specific_id and running_eta fields.

Flags: needinfo?(cdawson)

camd: I thought one of the features of a ORM is they provide migration tools: Able to alter table definitions on the fly while keeping the database accessible. Essentially implement the following instructions:

  1. Ensure audit trail is enabled for given tables
  2. Create new empty table(s) with new schema
  3. Ensure audit trail is used to keep new table(s) up to date (on new changes only)
  4. Copy historical records from old tables to new table(s) in small chunks
  5. Rename tables, update constrains

An audit trail is a set of tables and triggers that tracks all changes to a database; I built a couple (in the form of a stored procedure that will attach the requisite triggers to all the tables) for credit card startups to ensure there is an actual audit trail. They come in different forms, maybe the internet has one we can drop in.

could we not only look at pushes that have a related decision task job( 'rt', 'add-new', 'bk', etc.) then analyze further for jobs that have a |scheduled time > decision task + (overhead: 10 minutes)|

:igoldan What property (or properties) are used to identify a retrigger? Even if this is associated with one task, we can accumulate all the other tasks in the group the get the total machine time.

(In reply to Cameron Dawson [:camd] from comment #9)

Ionut: I was thinking of this very thing. The job table is huge, so the migration to add a field would be significant. But this would be a good thing to do. So we may want to take that hit and do it over a weekend sometime.

Currently, I added a Treeherder API called /taskclustermetadata/ where you can pass in a comma-separated list of job ids and it will give the task_id and retry_id for each.

But, yes, you're correct that we could add the fields to the job table, then update the ingestion task to populate it. We would also run a SET query to backfill the existing job fields.

We may even have had some other data migrations we wanted to do to the job table at the same time. I'd have to checkout bugzilla to see if we can find them. Looking at the DB, I think we can remove project_specific_id and running_eta fields.

I believe I didn't express myself too well, as this isn't what I had in mind. My intention is to extend the taskcluster_metadata table from Treeherder to include another column named dependency_task_id. I've attached a screenshot with some notes, so you can better picture this column's purpose.
For example, this AC(rt) has Task: Rr-HYROnQGqEa8saMbmzOg. Its a dependecy for this T(c) task, as it scheduled it to run. Basically, T(c) should have Task: F-BzaHsWTlWOs-bkx2xe4Q and Dependency task: Rr-HYROnQGqEa8saMbmzOg.

My main question on that is: can we change the existing endpoint that's providing the task_id & retry_id to also provide the dependency_task_id?

Flags: needinfo?(cdawson)

(In reply to Kyle Lahnakoski [:ekyle] from comment #12)

:igoldan What property (or properties) are used to identify a retrigger? Even if this is associated with one task, we can accumulate all the other tasks in the group the get the total machine time.

At the moment, the taskcluster_metadata.task_id associated to a job is the starting point. It allows us to query Taskcluster using that task_id and get task_ids of the dependencies. If among those dependencies we find an AC(rt) job, then the job was retriggered.

(In reply to Ionuț Goldan [:igoldan], Performance Sheriff from comment #13)

(In reply to Cameron Dawson [:camd] from comment #9)

Ionut: I was thinking of this very thing. The job table is huge, so the migration to add a field would be significant. But this would be a good thing to do. So we may want to take that hit and do it over a weekend sometime.

Currently, I added a Treeherder API called /taskclustermetadata/ where you can pass in a comma-separated list of job ids and it will give the task_id and retry_id for each.

But, yes, you're correct that we could add the fields to the job table, then update the ingestion task to populate it. We would also run a SET query to backfill the existing job fields.

We may even have had some other data migrations we wanted to do to the job table at the same time. I'd have to checkout bugzilla to see if we can find them. Looking at the DB, I think we can remove project_specific_id and running_eta fields.

I believe I didn't express myself too well, as this isn't what I had in mind. My intention is to extend the taskcluster_metadata table from Treeherder to include another column named dependency_task_id. I've attached a screenshot with some notes, so you can better picture this column's purpose.
For example, this AC(rt) has Task: Rr-HYROnQGqEa8saMbmzOg. Its a dependecy for this T(c) task, as it scheduled it to run. Basically, T(c) should have Task: F-BzaHsWTlWOs-bkx2xe4Q and Dependency task: Rr-HYROnQGqEa8saMbmzOg.

My main question on that is: can we change the existing endpoint that's providing the task_id & retry_id to also provide the dependency_task_id?

We could certainly change the api to return that as well. But I'll admit, I don't know how to find that. If it only comes during the task ingestion, then, as you say, we'll need to add a field to that metadata table. Same issue. It's fine to do so, but that's a large table (with few fields) so it may be a slow migration. I'm not certain HOW slow with our current database engine. Maybe it's not so bad. :) Regardless, I have no objections to doing this. Make it so! :)

Flags: needinfo?(cdawson)
Assignee: ariakab → nobody
Status: ASSIGNED → NEW
Summary: Measure current machine time of retriggered/backfilled jobs → Measure machine time of retriggered/backfilled jobs
Priority: P1 → P3
You need to log in before you can comment on or make changes to this bug.