Add Perfherder to sheriff bug information to BigQuery
Categories
(Data Platform and Tools :: General, enhancement, P1)
Tracking
(Not tracked)
People
(Reporter: ccd, Assigned: trink)
References
Details
In the Treeherder MySQL instance there is information relating to regression alerts, and corresponding sheriff activity. This information is missing from the moz-fx-data-derived-datasets.taskclusteretl dataset and perfherder table. Therequest is to add this information into the BigQuery dataset.
This query shows a use case.
The missing MySQL tables are:
performance_alertperformance_alert_summary
| Assignee | ||
Updated•6 years ago
|
Comment 1•6 years ago
|
||
this query would also need:
- performance_signature
- performance_framework
- repository
- machine_platform
for untangling performance_signature these fields exist:
- option_collection_id which would need the option_collection table and the option table
for performance_alert_summary these fields exist:
- assignee_id - not used based on code search
- issue_tracker_id - needs the issue_tracker database
- prev_push_id / push_id - needs the table: push
for repository, we need:
- repository_group_id - uses table: repository_group
I am not sure how the data is stored now and how we will relate performance_datum -> performance_signature & performance_alert;
:trink, is what I am outlining here useful? Can you comment on what you think we need to support these additional tables and their related data?
| Assignee | ||
Comment 2•6 years ago
|
||
Not all the data requested has to be extracted from Treeherder, BigQuery already has most of the metadata requested. What is really needed is a way to correlate the alert back to a specific task and down to a specific perherder entry within that task. From what I am seeing the Treeherder schema was not designed for this type of mapping; with a series of queries I can get close. I am hoping I am missing something and we can close the gap: I will use alert id 92312 to illustrate the issue:
-
Get the alert
select * from performance_alert where id = 92312
-- 92,312 0 4 4.13 -18.34 443.78 425.44 1,000.00 2,056,486 25,523 1 2,215 0 2020-04-01 13:42 2020-04-01 13:42 2020-04-01 13:45 -
Get the alert summary
select summary.*, framework.name from performance_alert_summary as summary inner join performance_framework as framework on (summary.framework_id = framework.id) where push_id = 671012
-- 25,523 2020-04-01 13:42 4 1,021,214 1 77 1 671,002 671,012 1 2020-04-01 13:42 2020-04-01 13:45 2020-04-01 13:45 2,215 talos -
Using the push_id and datum with the alert signature id we can get the job_id, taskId and machine for the list of tasks that potentially map to the alert
select job.id, md.task_id, machine.name from job as job inner join taskcluster_metadata as md on (md.job_id = job.id) inner join machine as machine on (job.machine_id = machine.id) where job.id in (select job_id from performance_datum where job_id in (select id from job where push_id = 671012) and signature_id = 2056486)
-- 295,012,321 Zohdn1VTQ_yqQWwGJLQNCQ t-mojave-r7-451
-- 295,705,139 bPoHH39FS_yCUV9a7qodIA t-mojave-r7-317
-- 295,705,131 JTyTWPRVQtiSuSVudGtHCA t-mojave-r7-104 -
So the alert should correlate to one of those tasks but I don't have a programmatic way to select the correct one
-
On the BigQuery side I see the linking to the perfherder data by taskId, framework, machine name (would have to be added to the BQ schema). With the addition of the extra raptor framework test variations (cpu, memory, power) we will most likely want to make suite as part of the linking too. Another issue is runId which I don't see a way to disambiguate on the Treeherder side and could lead to an incorrect mapping.
| Reporter | ||
Comment 3•6 years ago
|
||
:jmaher Do you have insight on :trink's comment above? I am not familiar enough yet with the process to provide any help.
Comment 4•6 years ago
|
||
I see the problem, but the practice and design are both built around an alert being on a given push, despite having multiple jobs (task_ids) that map to a specific taskgroup/push. As there is so much noise in the majority of the tests, collecting a series of data points shows a range of values for each push and if that range has shifted, then we have an alert.
So if we can import an alert and have it match a push_id (or branch/revision), then it doesn't have to match a single task_id, but a series of task_ids. If we assume that can we make the logical connection?
| Assignee | ||
Comment 5•6 years ago
•
|
||
The first pass is ready for review/experimentation in moz-fx-data-taskclu-prod-8fbf.taskclusteretl.perfherder_alert (this table is subject to change based on feedback and will only be considered finalize when the bug is closed)
Notes:
- The data was back filled to Feb 1 (the date the last major revision to the perfherder data loader was deployed)
- The data is updated once a day (a minimum of one hour after UTC midnight but before any schedule queries run). There is no reason it cannot be more frequent but generally it is a very small amount of sporadically available data.
- The alerts are extracted by their last_updated time and appended to the table (basically it is a daily snapshot containing all revisions)
- The data is joined on the push revision as that is the only revision identifier the perfherder table has (unlike the query above that uses the commit table)
- The alerts dates don't always correspond to the actual task run date so it is better to right join on the alerts to prevent a large scan on the perfherder data since you won't actually know what date partitions are needed. Here is a simple example query:
Tracking iterative changes here:
- Added extra_options for additional disambiguation
- alert.revision was renamed alert.push_revision
- the schema was updated with required field specifications that match the mysql definition
SELECT
DISTINCT taskId,
perf.framework,
perf.revision,
perf.platform,
perf.symbol,
perf.groupSymbol,
alert.suite,
alert.status,
alert.notes,
alert.bug_number,
alert.extra_options
FROM
taskclusteretl.perfherder AS perf
LEFT JOIN
taskclusteretl.perfherder_alert AS alert
ON
perf.revision = alert.push_revision
AND perf.framework = alert.framework
AND perf.platform = alert.platform
WHERE
EXISTS (
SELECT
s.name,
s.extraOptions
FROM
UNNEST(perf.suites) AS s
WHERE
s.name = alert.suite
AND alert.extra_options = ARRAY_TO_STRING(s.extraOptions, " "))
AND time >= "2020-04-07"
AND time < "2020-04-08"
| Assignee | ||
Comment 6•6 years ago
|
||
Currently the data being pulled is the original summary (summary_id) so I will change it to pull the reassigned summary (related_summary_id) like the query in initial report . Is there any case where you would be interested in the original summary information too? (sometimes the snapshot will capture it but it is not guaranteed at this point)
| Assignee | ||
Comment 7•6 years ago
|
||
For consistency I make sure the original summary is included in the history.
The query/code can be reviewed here:
https://github.com/mozilla-services/lua_sandbox_extensions/pull/513/files#diff-28223613da61d722f3f2caa0bbd5766fR56
| Assignee | ||
Comment 8•6 years ago
|
||
This has been deployed to production, if any changes are required just create a new bug with the specifics.
Comment 9•6 years ago
|
||
I've been downloading and using the data here this past week, and it looks good to me. The only issues I had were null recording dates and confusion over the meaning and nullity of replicates, base_replicates, and ref_replicates. The first looks fixed in bug 1630060 and we can file a bug for the second if it becomes important. Thanks!
| Assignee | ||
Comment 10•6 years ago
|
||
The 'second' is data in the perfherder output (i.e. not from a source that has to be merged in) so that is completely under control of the performance framework and not a function of the ETL process.
Updated•3 years ago
|
Description
•