Closed
Bug 1296634
Opened 9 years ago
Closed 9 years ago
Export Treeherder job snapshots to S3
Categories
(Testing Graveyard :: ActiveData, defect)
Testing Graveyard
ActiveData
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ekyle, Assigned: ekyle)
References
Details
Attachments
(2 files)
Pulling all of TH data via the API proved difficult [1]. This means we must build something to either pull or push information out of Treeherder. It might be most efficient for TH to push the data out to s3, so that it can control its resource usage. I imagine a process will be responsible for aggregating the stream of snapshots into batches of ?1000?, and writing to an s3 bucket for other applications to ingest.
This backup can be used by ActiveData, Telemetry, and any other scripts that want a large chunk of data fast. It may also serve as a long term backup for anthropological reasons.
I will attach bugs that list the specific benefits.
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1289830
Comment 1•9 years ago
|
||
++ on this idea, it's been something I've been thinking of lately.
We could also use this for populating a development instance with treeherder data.
| Assignee | ||
Comment 2•9 years ago
|
||
I hoe
| Assignee | ||
Comment 3•9 years ago
|
||
I added camd and emorley with the hope they (and wlachance) can share ideas about how best this can be done. I hope to avoid using the django, or any python, so we minimize resources this task requires.
1) Ideally, a builtin MySQL feature would be best, like replication, but I fear they are too general, and would pickup security tables, and other operations-specific data.
2) My next best idea is an extract query: It too benefits from avoiding python, but it must be maintained over database migrations. I will attach a fun query that produces JSON, unpacks the artifacts, and includes "everything". Not that it is meant to be run.
3) We could dynamically generate the required query by navigating the database relations, and providing a config file as to what files to NOT include. This will be more robust over database migrations, and can be used on other databases. It would be a stand-alone tool to point at a database to get a streaming backup. It may also work better remotely: I assume the native MySQL transmission format would be denser than JSON.
| Assignee | ||
Comment 4•9 years ago
|
||
Queries like this show SQL is just another language, like Python, but awesomer.
Comment 5•9 years ago
|
||
I see this as blocking code coverage to some extent, we now have the redash interface, I do wonder if that helps this bug, invalidates this bug, or isn't related.
Comment 6•9 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #5)
> I see this as blocking code coverage to some extent, we now have the redash
> interface, I do wonder if that helps this bug, invalidates this bug, or
> isn't related.
Not sure what this has to do with code coverage. We may still want to do this, though I think it's a lot less of a priority now that we have the read-only replica (and redash).
Updated•9 years ago
|
Comment 7•9 years ago
|
||
I think the only use for this bug now is for people trying to bootstrap a local Treeherder development environment. (And there may be a better solution for this than dumping to S3).
IMO all other use cases are better served (or at least _could_ be served) by pointing people at redash or giving them access to the read-only Treeherder replica.
| Assignee | ||
Comment 8•9 years ago
|
||
I am moving to ActiveData. I am still working on it.
| Assignee | ||
Comment 9•9 years ago
|
||
This is now running every 10min to extract job records (and related data). The process is currently extracting 30+jobs/second, and should catch up in about a week.
Ed, Your feedback is valuable; you are good at seeing perspectives I have a tendency to miss. I hope you have time to get a high level understanding of what I have done and tell me your concerns. Thank you.
Code
https://github.com/klahnakoski/MySQL-to-S3
Config
https://github.com/klahnakoski/MySQL-to-S3/blob/master/resources/config/treeherder.json
Example file
https://s3-us-west-2.amazonaws.com/active-data-treeherder-jobs/690.102.json.gz
Example document (attached)
Attachment #8837139 -
Flags: feedback?(emorley)
| Assignee | ||
Comment 10•9 years ago
|
||
The planned schema change will test the dynamic schema exploration.
https://bugzilla.mozilla.org/show_bug.cgi?id=1323110
Comment 11•9 years ago
|
||
Comment on attachment 8837139 [details]
690.103.106.json
I'm not sure exactly what will be using these, so struggling to think of feedback.
My only thought is how will this handle eg tables/fields being renamed/added/dropped (in terms of ingestion and also output - eg will a consumer need to expect multiple formats of json output depending on age?).
Attachment #8837139 -
Flags: feedback?(emorley)
| Assignee | ||
Comment 12•9 years ago
|
||
Ed, Thank you for you time!
The use cases for this data are listed in the dependent bugs.
The overall process scans the database for schema changes, and ensures those changes are reflected in the output JSON. You rightly concluded the JSON will change format over time as the database schema changes; this is unavoidable.
| Assignee | ||
Comment 13•9 years ago
|
||
This is done. There appears to be a few holes in the ES data, but I am sure they can be back filled.
https://activedata.allizom.org/tools/query.html#query_id=IBtfSDPN
https://activedata.allizom.org/tools/query.html#query_id=PrsJ1cw+
| Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•4 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•