1625009 - Add Perfherder to sheriff bug information to BigQuery

Reporter

Description

•

6 years ago

In the Treeherder MySQL instance there is information relating to regression alerts, and corresponding sheriff activity. This information is missing from the moz-fx-data-derived-datasets.taskclusteretl dataset and perfherder table. Therequest is to add this information into the BigQuery dataset.

This query shows a use case.

The missing MySQL tables are:

performance_alert
performance_alert_summary

Mike Trinkala [:trink]

Assignee

Updated

•

6 years ago

Assignee: nobody → mtrinkala

Points: --- → 3

Priority: -- → P1

"Saptarshi Guha[:joy]"

Updated

•

6 years ago

Blocks: 1626024

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

6 years ago

this query would also need:

performance_signature
performance_framework
repository
machine_platform

for untangling performance_signature these fields exist:

option_collection_id which would need the option_collection table and the option table

for performance_alert_summary these fields exist:

assignee_id - not used based on code search
issue_tracker_id - needs the issue_tracker database
prev_push_id / push_id - needs the table: push

for repository, we need:

repository_group_id - uses table: repository_group

I am not sure how the data is stored now and how we will relate performance_datum -> performance_signature & performance_alert;

:trink, is what I am outlining here useful? Can you comment on what you think we need to support these additional tables and their related data?

Flags: needinfo?(mtrinkala)

Mike Trinkala [:trink]

Assignee

Comment 2

•

6 years ago

Not all the data requested has to be extracted from Treeherder, BigQuery already has most of the metadata requested. What is really needed is a way to correlate the alert back to a specific task and down to a specific perherder entry within that task. From what I am seeing the Treeherder schema was not designed for this type of mapping; with a series of queries I can get close. I am hoping I am missing something and we can close the gap: I will use alert id 92312 to illustrate the issue:

Get the alert
select * from performance_alert where id = 92312
-- 92,312 0 4 4.13 -18.34 443.78 425.44 1,000.00 2,056,486 25,523 1 2,215 0 2020-04-01 13:42 2020-04-01 13:42 2020-04-01 13:45
Get the alert summary
select summary.*, framework.name from performance_alert_summary as summary inner join performance_framework as framework on (summary.framework_id = framework.id) where push_id = 671012
-- 25,523 2020-04-01 13:42 4 1,021,214 1 77 1 671,002 671,012 1 2020-04-01 13:42 2020-04-01 13:45 2020-04-01 13:45 2,215 talos
Using the push_id and datum with the alert signature id we can get the job_id, taskId and machine for the list of tasks that potentially map to the alert
select job.id, md.task_id, machine.name from job as job inner join taskcluster_metadata as md on (md.job_id = job.id) inner join machine as machine on (job.machine_id = machine.id) where job.id in (select job_id from performance_datum where job_id in (select id from job where push_id = 671012) and signature_id = 2056486)
-- 295,012,321 Zohdn1VTQ_yqQWwGJLQNCQ t-mojave-r7-451
-- 295,705,139 bPoHH39FS_yCUV9a7qodIA t-mojave-r7-317
-- 295,705,131 JTyTWPRVQtiSuSVudGtHCA t-mojave-r7-104
So the alert should correlate to one of those tasks but I don't have a programmatic way to select the correct one
On the BigQuery side I see the linking to the perfherder data by taskId, framework, machine name (would have to be added to the BQ schema). With the addition of the extra raptor framework test variations (cpu, memory, power) we will most likely want to make suite as part of the linking too. Another issue is runId which I don't see a way to disambiguate on the Treeherder side and could lead to an incorrect mapping.

Flags: needinfo?(mtrinkala)

Corey Dow-Hygelund [:ccd]

Reporter

Comment 3

•

6 years ago

:jmaher Do you have insight on :trink's comment above? I am not familiar enough yet with the process to provide any help.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

6 years ago

I see the problem, but the practice and design are both built around an alert being on a given push, despite having multiple jobs (task_ids) that map to a specific taskgroup/push. As there is so much noise in the majority of the tests, collecting a series of data points shows a range of values for each push and if that range has shifted, then we have an alert.

So if we can import an alert and have it match a push_id (or branch/revision), then it doesn't have to match a single task_id, but a series of task_ids. If we assume that can we make the logical connection?

Flags: needinfo?(jmaher)

Mike Trinkala [:trink]

Assignee

Updated

•

6 years ago

Depends on: 1627720

Mike Trinkala [:trink]

Assignee

Comment 5

•

6 years ago

•

Edited

The first pass is ready for review/experimentation in moz-fx-data-taskclu-prod-8fbf.taskclusteretl.perfherder_alert (this table is subject to change based on feedback and will only be considered finalize when the bug is closed)
Notes:

The data was back filled to Feb 1 (the date the last major revision to the perfherder data loader was deployed)
The data is updated once a day (a minimum of one hour after UTC midnight but before any schedule queries run). There is no reason it cannot be more frequent but generally it is a very small amount of sporadically available data.
The alerts are extracted by their last_updated time and appended to the table (basically it is a daily snapshot containing all revisions)
The data is joined on the push revision as that is the only revision identifier the perfherder table has (unlike the query above that uses the commit table)
The alerts dates don't always correspond to the actual task run date so it is better to right join on the alerts to prevent a large scan on the perfherder data since you won't actually know what date partitions are needed. Here is a simple example query:

Tracking iterative changes here:

Added extra_options for additional disambiguation
alert.revision was renamed alert.push_revision
the schema was updated with required field specifications that match the mysql definition

SELECT
  DISTINCT taskId,
  perf.framework,
  perf.revision,
  perf.platform,
  perf.symbol,
  perf.groupSymbol,
  alert.suite,
  alert.status,
  alert.notes,
  alert.bug_number,
  alert.extra_options
FROM
  taskclusteretl.perfherder AS perf
LEFT JOIN
  taskclusteretl.perfherder_alert AS alert
ON
  perf.revision = alert.push_revision
  AND perf.framework = alert.framework
  AND perf.platform = alert.platform
WHERE
  EXISTS (
  SELECT
    s.name,
    s.extraOptions
  FROM
    UNNEST(perf.suites) AS s
  WHERE
    s.name = alert.suite
    AND alert.extra_options = ARRAY_TO_STRING(s.extraOptions, " "))
  AND time >= "2020-04-07"
  AND time < "2020-04-08"

Mike Trinkala [:trink]

Assignee

Comment 6

•

6 years ago

Currently the data being pulled is the original summary (summary_id) so I will change it to pull the reassigned summary (related_summary_id) like the query in initial report . Is there any case where you would be interested in the original summary information too? (sometimes the snapshot will capture it but it is not guaranteed at this point)

Flags: needinfo?(cdowhygelund)

Mike Trinkala [:trink]

Assignee

Comment 7

•

6 years ago

For consistency I make sure the original summary is included in the history.

The query/code can be reviewed here:
https://github.com/mozilla-services/lua_sandbox_extensions/pull/513/files#diff-28223613da61d722f3f2caa0bbd5766fR56

Flags: needinfo?(cdowhygelund)

Mike Trinkala [:trink]

Assignee

Comment 8

•

6 years ago

This has been deployed to production, if any changes are required just create a new bug with the specifics.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Chris [:wbeard] (EST)

Comment 9

•

6 years ago

I've been downloading and using the data here this past week, and it looks good to me. The only issues I had were null recording dates and confusion over the meaning and nullity of replicates, base_replicates, and ref_replicates. The first looks fixed in bug 1630060 and we can file a bug for the second if it becomes important. Thanks!

Mike Trinkala [:trink]

Assignee

Comment 10

•

6 years ago

The 'second' is data in the perfherder output (i.e. not from a source that has to be merged in) so that is completely under control of the performance framework and not a function of the ETL process.

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Pipeline Ingestion → General

Bugzilla

Add Perfherder to sheriff bug information to BigQuery

Categories

(Data Platform and Tools :: General, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: ccd, Assigned: trink)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated