Export Treeherder job snapshots to S3

RESOLVED FIXED

Status

Testing
ActiveData
RESOLVED FIXED
10 months ago
2 months ago

People

(Reporter: ekyle, Assigned: ekyle)

Tracking

(Depends on: 1 bug, Blocks: 7 bugs)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Assignee)

Description

10 months ago
Pulling all of TH data via the API proved difficult [1].  This means we must build something to either pull or push information out of Treeherder.  It might be most efficient for TH to push the data out to s3, so that it can control its resource usage.  I imagine a process will be responsible for aggregating the stream of snapshots into batches of ?1000?, and writing to an s3 bucket for other applications to ingest.

This backup can be used by ActiveData, Telemetry, and any other scripts that want a large chunk of data fast.  It may also serve as a long term backup for anthropological reasons. 

I will attach bugs that list the specific benefits.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1289830
++ on this idea, it's been something I've been thinking of lately.

We could also use this for populating a development instance with treeherder data.
(Assignee)

Updated

10 months ago
Blocks: 1296637
(Assignee)

Updated

10 months ago
Blocks: 1296643
(Assignee)

Updated

10 months ago
Blocks: 1296650
(Assignee)

Updated

10 months ago
Blocks: 1296653
(Assignee)

Updated

10 months ago
Blocks: 1296671
(Assignee)

Updated

10 months ago
Blocks: 1296673
(Assignee)

Updated

10 months ago
Blocks: 1296686
(Assignee)

Updated

10 months ago
Blocks: 1296710
(Assignee)

Comment 2

10 months ago
I hoe
(Assignee)

Comment 3

10 months ago
I added camd and emorley with the hope they (and wlachance) can share ideas about how best this can be done.  I hope to avoid using the django, or any python, so we minimize resources this task requires. 

1) Ideally, a builtin MySQL feature would be best, like replication, but I fear they are too general, and would pickup security tables, and other operations-specific data. 
2) My next best idea is an extract query:  It too benefits from avoiding python, but it must be maintained over database migrations.  I will attach a fun query that produces JSON, unpacks the artifacts, and includes "everything".  Not that it is meant to be run.
3) We could dynamically generate the required query by navigating the database relations, and providing a config file as to what files to NOT include.  This will be more robust over database migrations, and can be used on other databases.  It would be a stand-alone tool to point at a database to get a streaming backup.  It may also work better remotely:  I assume the native MySQL transmission format would be denser than JSON.
(Assignee)

Comment 4

10 months ago
Created attachment 8783010 [details]
all_treeherder.sql

Queries like this show SQL is just another language, like Python, but awesomer.

Comment 5

5 months ago
I see this as blocking code coverage to some extent, we now have the redash interface, I do wonder if that helps this bug, invalidates this bug, or isn't related.
(In reply to Joel Maher ( :jmaher) from comment #5)
> I see this as blocking code coverage to some extent, we now have the redash
> interface, I do wonder if that helps this bug, invalidates this bug, or
> isn't related.

Not sure what this has to do with code coverage. We may still want to do this, though I think it's a lot less of a priority now that we have the read-only replica (and redash).

Comment 7

5 months ago
I think the only use for this bug now is for people trying to bootstrap a local Treeherder development environment. (And there may be a better solution for this than dumping to S3).

IMO all other use cases are better served (or at least _could_ be served) by pointing people at redash or giving them access to the read-only Treeherder replica.
(Assignee)

Comment 8

5 months ago
I am moving to ActiveData.  I am still working on it.
Assignee: nobody → klahnakoski
Component: Treeherder: API → ActiveData
Product: Tree Management → Testing
Version: --- → unspecified
(Assignee)

Comment 9

4 months ago
Created attachment 8837139 [details]
690.103.106.json


This is now running every 10min to extract job records (and related data). The process is currently extracting 30+jobs/second, and should catch up in about a week.

Ed, Your feedback is valuable; you are good at seeing perspectives I have a tendency to miss. I hope you have time to get a high level understanding of what I have done and tell me your concerns. Thank you.

Code
https://github.com/klahnakoski/MySQL-to-S3

Config
https://github.com/klahnakoski/MySQL-to-S3/blob/master/resources/config/treeherder.json

Example file
https://s3-us-west-2.amazonaws.com/active-data-treeherder-jobs/690.102.json.gz

Example document (attached)
Attachment #8837139 - Flags: feedback?(emorley)
(Assignee)

Comment 10

4 months ago
The planned schema change will test the dynamic schema exploration.

https://bugzilla.mozilla.org/show_bug.cgi?id=1323110

Comment 11

4 months ago
Comment on attachment 8837139 [details]
690.103.106.json

I'm not sure exactly what will be using these, so struggling to think of feedback.

My only thought is how will this handle eg tables/fields being renamed/added/dropped (in terms of ingestion and also output - eg will a consumer need to expect multiple formats of json output depending on age?).
Attachment #8837139 - Flags: feedback?(emorley)
(Assignee)

Comment 12

4 months ago
Ed, Thank you for you time!  

The use cases for this data are listed in the dependent bugs.

The overall process scans the database for schema changes, and ensures those changes are reflected in the output JSON. You rightly concluded the JSON will change format over time as the database schema changes; this is unavoidable.
(Assignee)

Comment 13

3 months ago
This is done.  There appears to be a few holes in the ES data, but I am sure they can be back filled. 

https://activedata.allizom.org/tools/query.html#query_id=IBtfSDPN
https://activedata.allizom.org/tools/query.html#query_id=PrsJ1cw+
(Assignee)

Updated

3 months ago
Status: NEW → RESOLVED
Last Resolved: 3 months ago
Resolution: --- → FIXED
(Assignee)

Updated

2 months ago
Depends on: 1361362
You need to log in before you can comment on or make changes to this bug.