Closed Bug 1572112 Opened 6 years ago Closed 6 years ago

Add support for GCP payload_bytes_decoded to python_mozaggregator

Categories

(Data Platform and Tools :: General, task, P1)

task
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: akomar, Assigned: amiyaguchi)

References

Details

Attachments

(3 files)

This looks to be the easiest way for running mozaggregator on GCP.

Blocks: 1572115

Arkadiusz, are you taking this on?

Flags: needinfo?(akomarzewski)

I've changed the title to reflect what I believe is actually going to happen here. Landfill (payload_bytes_raw) should generally be unused for analysis except for perhaps the schemas integration tests.

Summary: Add support for GCP landfill to python_mozaggregator → Add support for GCP payload_bytes_decoded to python_mozaggregator

(In reply to Frank Bertsch [:frank] from comment #1)

Arkadiusz, are you taking this on?

I don't have bandwidth now, maybe next week. If anyone picks this, here's a notebook where I tested reading from payload_bytes_decoded in Spark (we'll need to do something similar here):
https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/152757/command/152763

(In reply to Wesley Dawson [:whd] from comment #2)

I've changed the title to reflect what I believe is actually going to happen here. Landfill (payload_bytes_raw) should generally be unused for analysis except for perhaps the schemas integration tests.

Thanks! This bug is older than these tables.

Flags: needinfo?(akomarzewski)

Thanks Arkadiusz, I'm just checking to see if this was on your radar for this quarter. Sounds like it might be, I'm tentatively putting it as P3 (finished this Q) with you as assignee; feel free to redirect if needed.

Assignee: nobody → akomarzewski
Priority: -- → P3
See Also: → 1517018

I'm going to take this on, I have some experience reading from the payload_bytes_decoded dataset into Spark.

Assignee: akomarzewski → amiyaguchi
Points: --- → 3
Priority: P3 → P1
Attached file GitHub Pull Request

This intermediate PR adds a CLI interface for running directly on a Spark cluster via bdist_egg and --py-files. This runs on Databricks, which should enable the Dataset shim for BigQuery, and eventually running this job on DataProc.

This swaps out the EMRSparkOperator for the MozDatabricksRunSubmit operator. It should also be easier to add the proper credentials to Databricks clusters instead of the EMR clusters for the BigQuery shims.

Depends on: 1580957

I've tested that this works with the dataproc script in the bin folder of python_mozaggregator:

NUM_WORKERS=3 bin/dataproc.sh \
	parquet \
	--output gs://amiyaguchi-dev/mozaggregator/parquet_test/bq_prod/20191111 \
	--num-partitions 200 \
	--date 20191111 \
	--channels nightly \
	--source bigquery \
	--project-id moz-fx-data-shared-prod

This can also write to the dev postgres database hosted in AWS:

NUM_WORKERS=3 bin/dataproc.sh \
	aggregator \
	--credentials-protocol s3 \
	--credentials-bucket telemetry-spark-emr-2 \
	--credentials-prefix aggregator_dev_database_envvars.json \
	--num-partitions 200 \
	--date 20191101 \
	--source bigquery \
	--project-id moz-fx-data-shared-prod
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Datasets: Telemetry Aggregates → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: