Pulling all of TH data via the API proved difficult . This means we must build something to either pull or push information out of Treeherder. It might be most efficient for TH to push the data out to s3, so that it can control its resource usage. I imagine a process will be responsible for aggregating the stream of snapshots into batches of ?1000?, and writing to an s3 bucket for other applications to ingest. This backup can be used by ActiveData, Telemetry, and any other scripts that want a large chunk of data fast. It may also serve as a long term backup for anthropological reasons. I will attach bugs that list the specific benefits.  https://bugzilla.mozilla.org/show_bug.cgi?id=1289830
++ on this idea, it's been something I've been thinking of lately. We could also use this for populating a development instance with treeherder data.
I added camd and emorley with the hope they (and wlachance) can share ideas about how best this can be done. I hope to avoid using the django, or any python, so we minimize resources this task requires. 1) Ideally, a builtin MySQL feature would be best, like replication, but I fear they are too general, and would pickup security tables, and other operations-specific data. 2) My next best idea is an extract query: It too benefits from avoiding python, but it must be maintained over database migrations. I will attach a fun query that produces JSON, unpacks the artifacts, and includes "everything". Not that it is meant to be run. 3) We could dynamically generate the required query by navigating the database relations, and providing a config file as to what files to NOT include. This will be more robust over database migrations, and can be used on other databases. It would be a stand-alone tool to point at a database to get a streaming backup. It may also work better remotely: I assume the native MySQL transmission format would be denser than JSON.
Created attachment 8783010 [details] all_treeherder.sql Queries like this show SQL is just another language, like Python, but awesomer.
I see this as blocking code coverage to some extent, we now have the redash interface, I do wonder if that helps this bug, invalidates this bug, or isn't related.
(In reply to Joel Maher ( :jmaher) from comment #5) > I see this as blocking code coverage to some extent, we now have the redash > interface, I do wonder if that helps this bug, invalidates this bug, or > isn't related. Not sure what this has to do with code coverage. We may still want to do this, though I think it's a lot less of a priority now that we have the read-only replica (and redash).
I think the only use for this bug now is for people trying to bootstrap a local Treeherder development environment. (And there may be a better solution for this than dumping to S3). IMO all other use cases are better served (or at least _could_ be served) by pointing people at redash or giving them access to the read-only Treeherder replica.
I am moving to ActiveData. I am still working on it.
Created attachment 8837139 [details] 690.103.106.json This is now running every 10min to extract job records (and related data). The process is currently extracting 30+jobs/second, and should catch up in about a week. Ed, Your feedback is valuable; you are good at seeing perspectives I have a tendency to miss. I hope you have time to get a high level understanding of what I have done and tell me your concerns. Thank you. Code https://github.com/klahnakoski/MySQL-to-S3 Config https://github.com/klahnakoski/MySQL-to-S3/blob/master/resources/config/treeherder.json Example file https://s3-us-west-2.amazonaws.com/active-data-treeherder-jobs/690.102.json.gz Example document (attached)
The planned schema change will test the dynamic schema exploration. https://bugzilla.mozilla.org/show_bug.cgi?id=1323110
Comment on attachment 8837139 [details] 690.103.106.json I'm not sure exactly what will be using these, so struggling to think of feedback. My only thought is how will this handle eg tables/fields being renamed/added/dropped (in terms of ingestion and also output - eg will a consumer need to expect multiple formats of json output depending on age?).
Ed, Thank you for you time! The use cases for this data are listed in the dependent bugs. The overall process scans the database for schema changes, and ensures those changes are reflected in the output JSON. You rightly concluded the JSON will change format over time as the database schema changes; this is unavoidable.
This is done. There appears to be a few holes in the ES data, but I am sure they can be back filled. https://activedata.allizom.org/tools/query.html#query_id=IBtfSDPN https://activedata.allizom.org/tools/query.html#query_id=PrsJ1cw+