Closed Bug 1572112 Opened 6 years ago Closed 6 years ago

Add support for GCP payload_bytes_decoded to python_mozaggregator

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: akomar, Assigned: amiyaguchi)

References

Details

Attachments

(3 files)

Link to GitHub pull-request: https://github.com/mozilla/python_mozaggregator/pull/181 6 years ago GitHub Bugzilla PR Linker 56 bytes, text/x-github-pull-request		Details \| Review
GitHub Pull Request 6 years ago Anthony Miyaguchi [:amiyaguchi] 56 bytes, text/x-github-pull-request		Details \| Review
GitHub Pull Request - Move mozaggregator jobs to Databricks #593 6 years ago Anthony Miyaguchi [:amiyaguchi] 53 bytes, text/x-github-pull-request		Details \| Review

Arkadiusz Komarzewski [:akomar]

Reporter

Description

•

6 years ago

This looks to be the easiest way for running mozaggregator on GCP.

Arkadiusz Komarzewski [:akomar]

Reporter

Updated

•

6 years ago

Blocks: 1572115

Frank Bertsch [:frank]

Comment 1

•

6 years ago

Arkadiusz, are you taking this on?

Flags: needinfo?(akomarzewski)

Wesley Dawson [:whd]

Comment 2

•

6 years ago

I've changed the title to reflect what I believe is actually going to happen here. Landfill (payload_bytes_raw) should generally be unused for analysis except for perhaps the schemas integration tests.

Summary: Add support for GCP landfill to python_mozaggregator → Add support for GCP payload_bytes_decoded to python_mozaggregator

Arkadiusz Komarzewski [:akomar]

Reporter

Comment 3

•

6 years ago

•

Edited

(In reply to Frank Bertsch [:frank] from comment #1)

Arkadiusz, are you taking this on?

I don't have bandwidth now, maybe next week. If anyone picks this, here's a notebook where I tested reading from payload_bytes_decoded in Spark (we'll need to do something similar here):
https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/152757/command/152763

(In reply to Wesley Dawson [:whd] from comment #2)

I've changed the title to reflect what I believe is actually going to happen here. Landfill (payload_bytes_raw) should generally be unused for analysis except for perhaps the schemas integration tests.

Thanks! This bug is older than these tables.

Flags: needinfo?(akomarzewski)

Frank Bertsch [:frank]

Comment 4

•

6 years ago

Thanks Arkadiusz, I'm just checking to see if this was on your radar for this quarter. Sounds like it might be, I'm tentatively putting it as P3 (finished this Q) with you as assignee; feel free to redirect if needed.

Assignee: nobody → akomarzewski

Priority: -- → P3

Arkadiusz Komarzewski [:akomar]

Reporter

Updated

•

6 years ago

Comment 5

•

6 years ago

I'm going to take this on, I have some experience reading from the payload_bytes_decoded dataset into Spark.

Assignee: akomarzewski → amiyaguchi

Mark Reid [:mreid]

Updated

•

6 years ago

Points: --- → 3

Priority: P3 → P1

GitHub Bugzilla PR Linker

Comment 6

•

6 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/python_mozaggregator/pull/181 — Details

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 7

•

6 years ago

Attached file GitHub Pull Request — Details

This intermediate PR adds a CLI interface for running directly on a Spark cluster via bdist_egg and --py-files. This runs on Databricks, which should enable the Dataset shim for BigQuery, and eventually running this job on DataProc.

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 8

•

6 years ago

Attached file GitHub Pull Request - Move mozaggregator jobs to Databricks #593 — Details

This swaps out the EMRSparkOperator for the MozDatabricksRunSubmit operator. It should also be easier to add the proper credentials to Databricks clusters instead of the EMR clusters for the BigQuery shims.

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

6 years ago

Depends on: 1580957

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 9

•

6 years ago

I've tested that this works with the dataproc script in the bin folder of python_mozaggregator:

NUM_WORKERS=3 bin/dataproc.sh \
	parquet \
	--output gs://amiyaguchi-dev/mozaggregator/parquet_test/bq_prod/20191111 \
	--num-partitions 200 \
	--date 20191111 \
	--channels nightly \
	--source bigquery \
	--project-id moz-fx-data-shared-prod

This can also write to the dev postgres database hosted in AWS:

NUM_WORKERS=3 bin/dataproc.sh \
	aggregator \
	--credentials-protocol s3 \
	--credentials-bucket telemetry-spark-emr-2 \
	--credentials-prefix aggregator_dev_database_envvars.json \
	--num-partitions 200 \
	--date 20191101 \
	--source bigquery \
	--project-id moz-fx-data-shared-prod

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Datasets: Telemetry Aggregates → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Add support for GCP payload_bytes_decoded to python_mozaggregator

Categories

(Data Platform and Tools :: General, task, P1)

Tracking

(Not tracked)

People

(Reporter: akomar, Assigned: amiyaguchi)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Updated

Attachment

General

Description

File Name

Content Type