Closed Bug 1325667 Opened 7 years ago Closed 7 years ago

Persist parquet data from hindsight to S3

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: mreid, Assigned: robotblake)

References

Details

(Whiteboard: [SvcOps])

Mark Reid [:mreid]

Reporter

Description

•

7 years ago

Given an output plugin to convert data direct to parquet, we would like to upload that data to S3 so it can be made available in re:dash.

That should involve:
1. Decide upon an S3 location. I propose s3://telemetry-parquet/data-lake/<docType>/v<targetVersion>/<partitions>/somefile.parquet
2. Set up a data lake loader (similar to our data warehouse loader that uploads heka-framed data) that writes files locally
3. Set up the file-uploader as used on the existing edge nodes that can upload and prune completed parquet files.
4. Run parquet2hive on the "data-lake" dir so that new partitions are made available to re:dash

Thomas Huelbert

Comment 1

•

7 years ago

please work with Trink for more info as needed and update scope.

Assignee: nobody → whd

Priority: -- → P2

Wesley Dawson [:whd]

Comment 2

•

7 years ago

I didn't see this bug, but all the steps I'm responsible for (1-3) are completed. The data currently goes to s3://net-mozaws-prod-us-west-2-pipeline-data/*-parquet prefixes but this can be changed easily (s3://telemetry-parquet is hosted in dev and thus I am avoiding writing to it from prod). We're currently performing direct-to-parquet for core (bug #1333203) and testpilot (bug #1333206) pings, but others can be added as needed. This is performed on the regular DWL.

I believe (4) is blocked by bug #1333066, so I'm marking that as a blocker.

Depends on: 1333066

Jason Thomas [:jason]

Updated

•

7 years ago

Whiteboard: [SvcOps]

Wesley Dawson [:whd]

Comment 3

•

7 years ago

Per bug #1344349 we've adopted a standard versioning policy on direct-to-parquet data, and 
:robotblake's working on the p2h import stuff which will make new direct-to-parquet datasets automatically imported into presto. When that's done I think this bug can be closed.

Assignee: whd → bimsland

Status: NEW → ASSIGNED

Points: --- → 1

Blake Imsland [:robotblake]

Assignee

Updated

•

7 years ago

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Persist parquet data from hindsight to S3

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

Tracking

(Not tracked)

People

(Reporter: mreid, Assigned: robotblake)

References

Details

(Whiteboard: [SvcOps])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Updated

Updated