Closed Bug 1325667 Opened 7 years ago Closed 7 years ago

Persist parquet data from hindsight to S3

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: robotblake)

References

Details

(Whiteboard: [SvcOps])

Given an output plugin to convert data direct to parquet, we would like to upload that data to S3 so it can be made available in re:dash.

That should involve:
1. Decide upon an S3 location. I propose s3://telemetry-parquet/data-lake/<docType>/v<targetVersion>/<partitions>/somefile.parquet
2. Set up a data lake loader (similar to our data warehouse loader that uploads heka-framed data) that writes files locally
3. Set up the file-uploader as used on the existing edge nodes that can upload and prune completed parquet files.
4. Run parquet2hive on the "data-lake" dir so that new partitions are made available to re:dash
please work with Trink for more info as needed and update scope.
Assignee: nobody → whd
Priority: -- → P2
I didn't see this bug, but all the steps I'm responsible for (1-3) are completed. The data currently goes to s3://net-mozaws-prod-us-west-2-pipeline-data/*-parquet prefixes but this can be changed easily (s3://telemetry-parquet is hosted in dev and thus I am avoiding writing to it from prod). We're currently performing direct-to-parquet for core (bug #1333203) and testpilot (bug #1333206) pings, but others can be added as needed. This is performed on the regular DWL.

I believe (4) is blocked by bug #1333066, so I'm marking that as a blocker.
Depends on: 1333066
Whiteboard: [SvcOps]
Per bug #1344349 we've adopted a standard versioning policy on direct-to-parquet data, and 
:robotblake's working on the p2h import stuff which will make new direct-to-parquet datasets automatically imported into presto. When that's done I think this bug can be closed.
Assignee: whd → bimsland
Status: NEW → ASSIGNED
Points: --- → 1
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.