Given an output plugin to convert data direct to parquet, we would like to upload that data to S3 so it can be made available in re:dash. That should involve: 1. Decide upon an S3 location. I propose s3://telemetry-parquet/data-lake/<docType>/v<targetVersion>/<partitions>/somefile.parquet 2. Set up a data lake loader (similar to our data warehouse loader that uploads heka-framed data) that writes files locally 3. Set up the file-uploader as used on the existing edge nodes that can upload and prune completed parquet files. 4. Run parquet2hive on the "data-lake" dir so that new partitions are made available to re:dash
please work with Trink for more info as needed and update scope.
Assignee: nobody → whd
Priority: -- → P2
I didn't see this bug, but all the steps I'm responsible for (1-3) are completed. The data currently goes to s3://net-mozaws-prod-us-west-2-pipeline-data/*-parquet prefixes but this can be changed easily (s3://telemetry-parquet is hosted in dev and thus I am avoiding writing to it from prod). We're currently performing direct-to-parquet for core (bug #1333203) and testpilot (bug #1333206) pings, but others can be added as needed. This is performed on the regular DWL. I believe (4) is blocked by bug #1333066, so I'm marking that as a blocker.
Depends on: 1333066
Per bug #1344349 we've adopted a standard versioning policy on direct-to-parquet data, and :robotblake's working on the p2h import stuff which will make new direct-to-parquet datasets automatically imported into presto. When that's done I think this bug can be closed.
Assignee: whd → bimsland
Status: NEW → ASSIGNED
Points: --- → 1
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.