1617674 - Review jobs in bigquery

There are short periods when two versions of the same script are running; adding the same data
As a job evolves it is extracted again to get the later data
Other duplication to ensure all jobs are caught, even if multiple times.

To get a clean version of the data use this subquery:

SELECT * EXCEPT (_rank) 
FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY submit__time) AS _rank 
    FROM treeherder_2d_stage.jobs
) a 
WHERE _rank=1
LIMIT 10

(remove the limit clause if you are serious)

Kyle Lahnakoski [:ekyle]

Reporter

Comment 2

•

4 years ago

Other notes

The extract is run once per hour; which is good for the initial extract (which took 5 days), but can be changed to every 10min when we want
The big query api is limited to alphanumeric names: Underscore is escaped as double-underscore (<underscore><underscore>), and all other characters are escaped as <underscore><utf8 hex code><underscore>.
Datatypes are included in the names, which allows the schema to expand over time. https://github.com/klahnakoski/jx-bigquery/blob/3a704ccaed7de7e1014392c82e4d0b863e59aefb/jx_bigquery/typed_encoder.py#L158 The double-underscore escaping applies, so __n__ is numeric.
Arrays (nested) are given the name __a__; which allows us to encode more than just arrays of objects. We can encode arrays of arrays (__a__.__a__) or arrays of numbers (__a__.__n__). These are not realized in this schema at this time, but others will have them.
We can remove these type-indicators: If we do it with the extraction code we will need a protocol for what to do when the Treeherder datatypes change. The type-indicators can also be removed with a view; which may be easier.

Kyle Lahnakoski [:ekyle]

Reporter

Updated

•

4 years ago

Type: defect → task

Kyle Lahnakoski [:ekyle]

Reporter

Comment 3

•

4 years ago

Reminder the is STAGE data!

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

4 years ago

I have done a few queries already to look for intermittents and regressions. Once Kyle explained the need for unnest() when there is a a, I was able to play with the data. I was able to match up intermittent failure view, to look at bugs and specific instances. Likewise I took a look at some fixed_by_commit data and that matched up as well.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Assignee

Comment 5

•

4 years ago

Thanks for implementing this. Checked some queries based on pushes and job symbols and values are similar (didn't filter with comment 1). Queries against failure lines take one second compare to 2 minutes in the best case with redash (more realistically 3 mins -> timeout -> new run which completes after 2 minutes = 5 minutes).

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Assignee

Comment 6

•

4 years ago

Is anything further needed or can we close this bug?
Will the commit table data from redash get added to BQ? push.revision contains only the latest. If a push e.g. contains two commits for backouts, the older one won't be available at the moment.

Kyle Lahnakoski [:ekyle]

Reporter

Comment 7

•

4 years ago

:aryx

This bug cam be closed when you find no egregious problems with the structure of the data.

On the subject of the commit table, I will add the full set of commits for every push.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Assignee

Comment 8

•

4 years ago

How are the job ids populated? Will Treeherder's production db and BQ eventually have the same ones? That would be necessary to say "Here, we found these jobs and these are their Treeherder URLs, look there for further information."

Kyle Lahnakoski [:ekyle]

Reporter

Comment 9

•

4 years ago

Yes, we are storing the job.id from the treeherder database, so we will be able to construct the URL to open the same job in Treeherder.

The extraction query looks like:

        SELECT
            `t1`.`id` AS `c0`, `t1`.`guid` AS `c1`, ...
        FROM
            (SELECT <some list of recently changed job ids>) AS `t0`
                LEFT JOIN
            `test_treeherder`.`job` AS `t1` ON `t1`.`id` = `t0`.`id`
        ...

https://github.com/mozilla/treeherder/blob/master/tests/extract/test_extract_job.sql

This query is using the sorted extraction pattern to pull all the related records in a single call.

The jobs are currently updated once per hour, so BigQuery will be a bit behind of Treeherder. As of today, the extract to BigQuery is still catching up, so the data is still scattered among multiple tables waiting for a final merge.

Kyle Lahnakoski [:ekyle]

Reporter

Comment 10

•

4 years ago

... the PRODUCTION extract to BigQuery is still catching up,

Kyle Lahnakoski [:ekyle]

Reporter

Comment 11

•

4 years ago

It looks like the jobs merge is ready today, but failing:

"Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex."

I am fixing that.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Assignee

Comment 12

•

4 years ago

Are the plans to proceed with treeherder_2d_stage? Latest data is from March 18th.

Flags: needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Reporter

Comment 13

•

4 years ago

Since my last comment, I have been working on the problem. Unfortunately, the production debug cycle is about 2 weeks long and only happening in production. My test suite continues to grow as I make more speculation about what's happening. I am doubting the wisdom of putting this logic in the Treeherder codebase.

Anyway, this project is alive. It will get done.

Flags: needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Reporter

Comment 14

•

4 years ago

:aryx I did not read your last comment correctly. I see the ETL did stop on march 18th. I was not looking at it, I am looking now.

Kyle Lahnakoski [:ekyle]

Reporter

Comment 15

•

4 years ago

@aryx The treeherder_2d_stage has been caught up, the treeherder_2d_prod is now resumed filling (it was also stalled at march18, but that has also been cleared).

Flags: needinfo?(aryx.bugmail)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Assignee

Comment 16

•

4 years ago

Thank you, the query is showing up-to-date data.

Flags: needinfo?(aryx.bugmail)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Assignee

Comment 17

•

4 years ago

Please reopen if more needs to be tested/evaluated.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Review jobs in bigquery

Categories

(Tree Management :: Treeherder: Infrastructure, task)

Tracking

(Not tracked)

People

(Reporter: ekyle, Assigned: aryx)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17