Closed Bug 1296634 Opened 9 years ago Closed 9 years ago

Export Treeherder job snapshots to S3

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: ekyle, Assigned: ekyle)

References

Details

Attachments

(2 files)

all_treeherder.sql 9 years ago Kyle Lahnakoski [:ekyle] 3.21 KB, text/x-sql		Details
690.103.106.json 9 years ago Kyle Lahnakoski [:ekyle] 7.37 KB, text/plain		Details

Kyle Lahnakoski [:ekyle]

Assignee

Description

•

9 years ago

Pulling all of TH data via the API proved difficult [1]. This means we must build something to either pull or push information out of Treeherder. It might be most efficient for TH to push the data out to s3, so that it can control its resource usage. I imagine a process will be responsible for aggregating the stream of snapshots into batches of ?1000?, and writing to an s3 bucket for other applications to ingest. This backup can be used by ActiveData, Telemetry, and any other scripts that want a large chunk of data fast. It may also serve as a long term backup for anthropological reasons. I will attach bugs that list the specific benefits. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1289830

William Lachance (:wlach)

Comment 1

•

9 years ago

++ on this idea, it's been something I've been thinking of lately. We could also use this for populating a development instance with treeherder data.

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

9 years ago

Blocks: 1296637

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

9 years ago

Blocks: 1296643

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

9 years ago

Blocks: 1296650

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

9 years ago

Blocks: 1296653

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

9 years ago

Blocks: 1296671

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

9 years ago

Blocks: 1296673

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

9 years ago

Blocks: 1296686

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

9 years ago

Blocks: 1296710

Kyle Lahnakoski [:ekyle]

Assignee

Comment 2

•

9 years ago

I hoe

Kyle Lahnakoski [:ekyle]

Assignee

Comment 3

•

9 years ago

I added camd and emorley with the hope they (and wlachance) can share ideas about how best this can be done. I hope to avoid using the django, or any python, so we minimize resources this task requires. 1) Ideally, a builtin MySQL feature would be best, like replication, but I fear they are too general, and would pickup security tables, and other operations-specific data. 2) My next best idea is an extract query: It too benefits from avoiding python, but it must be maintained over database migrations. I will attach a fun query that produces JSON, unpacks the artifacts, and includes "everything". Not that it is meant to be run. 3) We could dynamically generate the required query by navigating the database relations, and providing a config file as to what files to NOT include. This will be more robust over database migrations, and can be used on other databases. It would be a stand-alone tool to point at a database to get a streaming backup. It may also work better remotely: I assume the native MySQL transmission format would be denser than JSON.

Kyle Lahnakoski [:ekyle]

Assignee

Comment 4

•

9 years ago

Attached file all_treeherder.sql — Details

Queries like this show SQL is just another language, like Python, but awesomer.

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

9 years ago

I see this as blocking code coverage to some extent, we now have the redash interface, I do wonder if that helps this bug, invalidates this bug, or isn't related.

William Lachance (:wlach)

Comment 6

•

9 years ago

(In reply to Joel Maher ( :jmaher) from comment #5) > I see this as blocking code coverage to some extent, we now have the redash > interface, I do wonder if that helps this bug, invalidates this bug, or > isn't related. Not sure what this has to do with code coverage. We may still want to do this, though I think it's a lot less of a priority now that we have the read-only replica (and redash).

William Lachance (:wlach)

Updated

•

9 years ago

No longer blocks: 1296637, 1296643, 1296650, 1296653, 1296671, 1296673, 1296686, 1296710

Ed Morley [:emorley]

Comment 7

•

9 years ago

I think the only use for this bug now is for people trying to bootstrap a local Treeherder development environment. (And there may be a better solution for this than dumping to S3). IMO all other use cases are better served (or at least _could_ be served) by pointing people at redash or giving them access to the read-only Treeherder replica.

Kyle Lahnakoski [:ekyle]

Assignee

Comment 8

•

9 years ago

I am moving to ActiveData. I am still working on it.

Assignee: nobody → klahnakoski

Blocks: 1296637, 1296643, 1296650, 1296653, 1296671, 1296673, 1296686, 1296710

Component: Treeherder: API → ActiveData

Product: Tree Management → Testing

Version: --- → unspecified

Kyle Lahnakoski [:ekyle]

Assignee

Comment 9

•

9 years ago

Attached file 690.103.106.json — Details

This is now running every 10min to extract job records (and related data). The process is currently extracting 30+jobs/second, and should catch up in about a week. Ed, Your feedback is valuable; you are good at seeing perspectives I have a tendency to miss. I hope you have time to get a high level understanding of what I have done and tell me your concerns. Thank you. Code https://github.com/klahnakoski/MySQL-to-S3 Config https://github.com/klahnakoski/MySQL-to-S3/blob/master/resources/config/treeherder.json Example file https://s3-us-west-2.amazonaws.com/active-data-treeherder-jobs/690.102.json.gz Example document (attached)

Attachment #8837139 - Flags: feedback?(emorley)

Kyle Lahnakoski [:ekyle]

Assignee

Comment 10

•

9 years ago

The planned schema change will test the dynamic schema exploration. https://bugzilla.mozilla.org/show_bug.cgi?id=1323110

Ed Morley [:emorley]

Comment 11

•

9 years ago

Comment on attachment 8837139 [details] 690.103.106.json I'm not sure exactly what will be using these, so struggling to think of feedback. My only thought is how will this handle eg tables/fields being renamed/added/dropped (in terms of ingestion and also output - eg will a consumer need to expect multiple formats of json output depending on age?).

Attachment #8837139 - Flags: feedback?(emorley)

Kyle Lahnakoski [:ekyle]

Assignee

Comment 12

•

9 years ago

Ed, Thank you for you time! The use cases for this data are listed in the dependent bugs. The overall process scans the database for schema changes, and ensures those changes are reflected in the output JSON. You rightly concluded the JSON will change format over time as the database schema changes; this is unavoidable.

Kyle Lahnakoski [:ekyle]

Assignee

Comment 13

•

9 years ago

This is done. There appears to be a few holes in the ES data, but I am sure they can be back filled. https://activedata.allizom.org/tools/query.html#query_id=IBtfSDPN https://activedata.allizom.org/tools/query.html#query_id=PrsJ1cw+

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Kyle Lahnakoski [:ekyle]

Assignee

Updated

•

8 years ago

Depends on: 1361362

BMO Automation

Updated

•

4 years ago

Product: Testing → Testing Graveyard

You need to log in before you can comment on or make changes to this bug.