Closed Bug 1617674 Opened 4 years ago Closed 4 years ago

Review jobs in bigquery

Categories

(Tree Management :: Treeherder: Infrastructure, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ekyle, Assigned: aryx)

References

Details

Assignee: nobody → aryx.bugmail
Summary: Review jobs in bugquery → Review jobs in bigquery

The data is available through the jobs view, but some browsing will show the source tables.

There are a lots of duplicates, and there will always be duplicates:

  1. There are short periods when two versions of the same script are running; adding the same data
  2. As a job evolves it is extracted again to get the later data
  3. Other duplication to ensure all jobs are caught, even if multiple times.

To get a clean version of the data use this subquery:

SELECT * EXCEPT (_rank) 
FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY submit__time) AS _rank 
    FROM treeherder_2d_stage.jobs
) a 
WHERE _rank=1
LIMIT 10 

(remove the limit clause if you are serious)

Other notes

  • The extract is run once per hour; which is good for the initial extract (which took 5 days), but can be changed to every 10min when we want
  • The big query api is limited to alphanumeric names: Underscore is escaped as double-underscore (<underscore><underscore>), and all other characters are escaped as <underscore><utf8 hex code><underscore>.
  • Datatypes are included in the names, which allows the schema to expand over time. https://github.com/klahnakoski/jx-bigquery/blob/3a704ccaed7de7e1014392c82e4d0b863e59aefb/jx_bigquery/typed_encoder.py#L158 The double-underscore escaping applies, so __n__ is numeric.
  • Arrays (nested) are given the name __a__; which allows us to encode more than just arrays of objects. We can encode arrays of arrays (__a__.__a__) or arrays of numbers (__a__.__n__). These are not realized in this schema at this time, but others will have them.
  • We can remove these type-indicators: If we do it with the extraction code we will need a protocol for what to do when the Treeherder datatypes change. The type-indicators can also be removed with a view; which may be easier.
Type: defect → task

Reminder the is STAGE data!

I have done a few queries already to look for intermittents and regressions. Once Kyle explained the need for unnest() when there is a a, I was able to play with the data. I was able to match up intermittent failure view, to look at bugs and specific instances. Likewise I took a look at some fixed_by_commit data and that matched up as well.

Thanks for implementing this. Checked some queries based on pushes and job symbols and values are similar (didn't filter with comment 1). Queries against failure lines take one second compare to 2 minutes in the best case with redash (more realistically 3 mins -> timeout -> new run which completes after 2 minutes = 5 minutes).

  1. Is anything further needed or can we close this bug?
  2. Will the commit table data from redash get added to BQ? push.revision contains only the latest. If a push e.g. contains two commits for backouts, the older one won't be available at the moment.

:aryx

This bug cam be closed when you find no egregious problems with the structure of the data.

On the subject of the commit table, I will add the full set of commits for every push.

How are the job ids populated? Will Treeherder's production db and BQ eventually have the same ones? That would be necessary to say "Here, we found these jobs and these are their Treeherder URLs, look there for further information."

Yes, we are storing the job.id from the treeherder database, so we will be able to construct the URL to open the same job in Treeherder.

The extraction query looks like:

        SELECT
            `t1`.`id` AS `c0`, `t1`.`guid` AS `c1`, ...
        FROM
            (SELECT <some list of recently changed job ids>) AS `t0`
                LEFT JOIN
            `test_treeherder`.`job` AS `t1` ON `t1`.`id` = `t0`.`id`
        ...

https://github.com/mozilla/treeherder/blob/master/tests/extract/test_extract_job.sql

This query is using the sorted extraction pattern to pull all the related records in a single call.

The jobs are currently updated once per hour, so BigQuery will be a bit behind of Treeherder. As of today, the extract to BigQuery is still catching up, so the data is still scattered among multiple tables waiting for a final merge.

... the PRODUCTION extract to BigQuery is still catching up,

It looks like the jobs merge is ready today, but failing:

"Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex."

I am fixing that.

Are the plans to proceed with treeherder_2d_stage? Latest data is from March 18th.

Flags: needinfo?(klahnakoski)

Since my last comment, I have been working on the problem. Unfortunately, the production debug cycle is about 2 weeks long and only happening in production. My test suite continues to grow as I make more speculation about what's happening. I am doubting the wisdom of putting this logic in the Treeherder codebase.

Anyway, this project is alive. It will get done.

Flags: needinfo?(klahnakoski)

:aryx I did not read your last comment correctly. I see the ETL did stop on march 18th. I was not looking at it, I am looking now.

@aryx The treeherder_2d_stage has been caught up, the treeherder_2d_prod is now resumed filling (it was also stalled at march18, but that has also been cleared).

Flags: needinfo?(aryx.bugmail)

Thank you, the query is showing up-to-date data.

Flags: needinfo?(aryx.bugmail)

Please reopen if more needs to be tested/evaluated.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.