Review jobs in bigquery
Categories
(Tree Management :: Treeherder: Infrastructure, task)
Tracking
(Not tracked)
People
(Reporter: ekyle, Assigned: aryx)
References
Details
The jobs from staging are in bigquery. They must be reviewed.
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Reporter | ||
Comment 1•4 years ago
|
||
The data is available through the jobs
view, but some browsing will show the source tables.
There are a lots of duplicates, and there will always be duplicates:
- There are short periods when two versions of the same script are running; adding the same data
- As a job evolves it is extracted again to get the later data
- Other duplication to ensure all jobs are caught, even if multiple times.
To get a clean version of the data use this subquery:
SELECT * EXCEPT (_rank)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY submit__time) AS _rank
FROM treeherder_2d_stage.jobs
) a
WHERE _rank=1
LIMIT 10
(remove the limit clause if you are serious)
Reporter | ||
Comment 2•4 years ago
|
||
Other notes
- The extract is run once per hour; which is good for the initial extract (which took 5 days), but can be changed to every 10min when we want
- The big query api is limited to alphanumeric names: Underscore is escaped as double-underscore (
<underscore><underscore>
), and all other characters are escaped as<underscore><utf8 hex code><underscore>
. - Datatypes are included in the names, which allows the schema to expand over time. https://github.com/klahnakoski/jx-bigquery/blob/3a704ccaed7de7e1014392c82e4d0b863e59aefb/jx_bigquery/typed_encoder.py#L158 The double-underscore escaping applies, so
__n__
is numeric. - Arrays (nested) are given the name
__a__
; which allows us to encode more than just arrays of objects. We can encode arrays of arrays (__a__.__a__
) or arrays of numbers (__a__.__n__
). These are not realized in this schema at this time, but others will have them. - We can remove these type-indicators: If we do it with the extraction code we will need a protocol for what to do when the Treeherder datatypes change. The type-indicators can also be removed with a view; which may be easier.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 3•4 years ago
|
||
Reminder the is STAGE data!
Comment 4•4 years ago
|
||
I have done a few queries already to look for intermittents and regressions. Once Kyle explained the need for unnest() when there is a a, I was able to play with the data. I was able to match up intermittent failure view, to look at bugs and specific instances. Likewise I took a look at some fixed_by_commit data and that matched up as well.
Assignee | ||
Comment 5•4 years ago
|
||
Thanks for implementing this. Checked some queries based on pushes and job symbols and values are similar (didn't filter with comment 1). Queries against failure lines take one second compare to 2 minutes in the best case with redash (more realistically 3 mins -> timeout -> new run which completes after 2 minutes = 5 minutes).
Assignee | ||
Comment 6•4 years ago
|
||
- Is anything further needed or can we close this bug?
- Will the
commit
table data from redash get added to BQ?push.revision
contains only the latest. If a push e.g. contains two commits for backouts, the older one won't be available at the moment.
Reporter | ||
Comment 7•4 years ago
|
||
:aryx
This bug cam be closed when you find no egregious problems with the structure of the data.
On the subject of the commit table, I will add the full set of commits for every push.
Assignee | ||
Comment 8•4 years ago
|
||
How are the job ids populated? Will Treeherder's production db and BQ eventually have the same ones? That would be necessary to say "Here, we found these jobs and these are their Treeherder URLs, look there for further information."
Reporter | ||
Comment 9•4 years ago
|
||
Yes, we are storing the job.id
from the treeherder database, so we will be able to construct the URL to open the same job in Treeherder.
The extraction query looks like:
SELECT
`t1`.`id` AS `c0`, `t1`.`guid` AS `c1`, ...
FROM
(SELECT <some list of recently changed job ids>) AS `t0`
LEFT JOIN
`test_treeherder`.`job` AS `t1` ON `t1`.`id` = `t0`.`id`
...
https://github.com/mozilla/treeherder/blob/master/tests/extract/test_extract_job.sql
This query is using the sorted extraction pattern to pull all the related records in a single call.
The jobs are currently updated once per hour, so BigQuery will be a bit behind of Treeherder. As of today, the extract to BigQuery is still catching up, so the data is still scattered among multiple tables waiting for a final merge.
Reporter | ||
Comment 10•4 years ago
|
||
... the PRODUCTION extract to BigQuery is still catching up,
Reporter | ||
Comment 11•4 years ago
|
||
It looks like the jobs merge is ready today, but failing:
"Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex."
I am fixing that.
Assignee | ||
Comment 12•4 years ago
|
||
Are the plans to proceed with treeherder_2d_stage
? Latest data is from March 18th.
Reporter | ||
Comment 13•4 years ago
|
||
Since my last comment, I have been working on the problem. Unfortunately, the production debug cycle is about 2 weeks long and only happening in production. My test suite continues to grow as I make more speculation about what's happening. I am doubting the wisdom of putting this logic in the Treeherder codebase.
Anyway, this project is alive. It will get done.
Reporter | ||
Comment 14•4 years ago
|
||
:aryx I did not read your last comment correctly. I see the ETL did stop on march 18th. I was not looking at it, I am looking now.
Reporter | ||
Comment 15•4 years ago
|
||
@aryx The treeherder_2d_stage
has been caught up, the treeherder_2d_prod
is now resumed filling (it was also stalled at march18, but that has also been cleared).
Assignee | ||
Comment 16•4 years ago
|
||
Thank you, the query is showing up-to-date data.
Assignee | ||
Comment 17•4 years ago
|
||
Please reopen if more needs to be tested/evaluated.
Description
•