Add support for GCP payload_bytes_decoded to python_mozaggregator
Categories
(Data Platform and Tools :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: akomar, Assigned: amiyaguchi)
References
Details
Attachments
(3 files)
This looks to be the easiest way for running mozaggregator on GCP.
Comment 2•6 years ago
|
||
I've changed the title to reflect what I believe is actually going to happen here. Landfill (payload_bytes_raw) should generally be unused for analysis except for perhaps the schemas integration tests.
| Reporter | ||
Comment 3•6 years ago
•
|
||
(In reply to Frank Bertsch [:frank] from comment #1)
Arkadiusz, are you taking this on?
I don't have bandwidth now, maybe next week. If anyone picks this, here's a notebook where I tested reading from payload_bytes_decoded in Spark (we'll need to do something similar here):
https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/152757/command/152763
(In reply to Wesley Dawson [:whd] from comment #2)
I've changed the title to reflect what I believe is actually going to happen here. Landfill (payload_bytes_raw) should generally be unused for analysis except for perhaps the schemas integration tests.
Thanks! This bug is older than these tables.
Comment 4•6 years ago
|
||
Thanks Arkadiusz, I'm just checking to see if this was on your radar for this quarter. Sounds like it might be, I'm tentatively putting it as P3 (finished this Q) with you as assignee; feel free to redirect if needed.
| Assignee | ||
Comment 5•6 years ago
|
||
I'm going to take this on, I have some experience reading from the payload_bytes_decoded dataset into Spark.
Updated•6 years ago
|
Comment 6•6 years ago
|
||
| Assignee | ||
Comment 7•6 years ago
|
||
This intermediate PR adds a CLI interface for running directly on a Spark cluster via bdist_egg and --py-files. This runs on Databricks, which should enable the Dataset shim for BigQuery, and eventually running this job on DataProc.
| Assignee | ||
Comment 8•6 years ago
|
||
This swaps out the EMRSparkOperator for the MozDatabricksRunSubmit operator. It should also be easier to add the proper credentials to Databricks clusters instead of the EMR clusters for the BigQuery shims.
| Assignee | ||
Comment 9•6 years ago
|
||
I've tested that this works with the dataproc script in the bin folder of python_mozaggregator:
NUM_WORKERS=3 bin/dataproc.sh \
parquet \
--output gs://amiyaguchi-dev/mozaggregator/parquet_test/bq_prod/20191111 \
--num-partitions 200 \
--date 20191111 \
--channels nightly \
--source bigquery \
--project-id moz-fx-data-shared-prod
This can also write to the dev postgres database hosted in AWS:
NUM_WORKERS=3 bin/dataproc.sh \
aggregator \
--credentials-protocol s3 \
--credentials-bucket telemetry-spark-emr-2 \
--credentials-prefix aggregator_dev_database_envvars.json \
--num-partitions 200 \
--date 20191101 \
--source bigquery \
--project-id moz-fx-data-shared-prod
Updated•3 years ago
|
Description
•