Closed Bug 1314120 Opened 9 years ago Closed 9 years ago

Productionize Socorro import

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: amiyaguchi)

References

Details

Attachments

(1 file)

The prototype import at [1] needs some work before it can be considered production. Namely: - Replace the hard-coded schema: Read the definitive JSON Schema for crashes from [2] and convert it to a Spark SQL struct - Replace the hard-coded version: Pull the "version" string from the above schema and use it as a path component when saving to S3 - Store the output data in the official S3 bucket (telemetry-parquet) - Schedule the generation code to run via Airflow [1] https://gist.github.com/mreid-moz/092029949782249577aee92602879e2b [2] https://github.com/mozilla/socorro/blob/master/socorro/schemas/crash_report.json
Assignee: nobody → amiyaguchi
Priority: -- → P2
Blocks: 1273657
See Also: → 1312006
Points: --- → 3
Depends on: 1314252
I've created a fork of the gist [1] that adds support for generating the Spark structs and pulling out the version number for versioning the S3 bucket. The results of a trial run for 11/01/29016 can be found under `s3://net-mozaws-prod-us-west-2-pipeline-analysis/mreid/crash/v4/v0/`. [1] https://gist.github.com/acmiyaguchi/bd08b62b025b80acc16efb63be29ea35
For what it's worth, the crash_report.json JSON Schema now has a version in it https://github.com/mozilla/socorro/commit/d429b403d7f5e44a7909656bc42cfaabae43520a It's still only available on github.com but by early next week it'll be in S3.
Peter, what is the path you plan to use on S3 for the crash schema?
Flags: needinfo?(peterbe)
(In reply to Mark Reid [:mreid] from comment #3) > Peter, what is the path you plan to use on S3 for the crash schema? /crash_report.json
Flags: needinfo?(peterbe)
Ok, and per bug 1311522, the bucket is org-mozilla-telemetry-crashes.
Blocks: 1303555
The crash data should be accessible for use. The Socorro import job is being run on airflow with the resulting data being placed into `s3://telemetry-parquet/socorro_crash/v1`.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: