Bug 1605442 Comment 5 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

For reference, the following commands were run in the testing session.

I've been notified by Google Cloud Support on case #21629086 that the Dataproc team has been investigating, with an update on 2019-12-23 and 2019-12-26. I'm expecting to hear back again on 2019-12-31. Case management through the console has been unavailable since 2019-12-20, so I have been communicating via email.

In order to improve robustness of the service, I'll be implementing an existing pathway for reading avro dumps of the BigQuery `payload_bytes_decoded` tables instead of relying on the Storage API for reading directly from the service.

I've tested the existing scripts with the following commands in the `python_mozaggregator` repo:

```bash
bin/export-avro.sh moz-fx-data-shar-nonprod-efed amiyaguchi-dev:avro_export gs://amiyaguchi-dev/avro-mozaggregator 2019-12-15

NUM_WORKERS=5 bin/dataproc.sh \
	mobile \
	--output gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/ \
	--num-partitions 200 \
	--date 20191215 \
	--source avro \
	--avro-prefix gs://amiyaguchi-dev/avro-mozaggregator/moz-fx-data-shar-nonprod-efed
```

This results in the following listing
```
$ gsutil ls -r gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/:
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/_SUCCESS

gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/submission_date=20191215/:
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/submission_date=20191215/
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/submission_date=20191215/part-00000-e72b47cc-92e6-4620-a6e0-1844861aebee.c000.snappy.parquet
```

I'm planning on taking the following route to implement reading from avro:

* [mozaggregator] modifying the published docker image `mozilla/python_mozaggregator:latest` to include the google-cloud-sdk
* [mozaggregator] updating `bin/export-avro.sh` to accept an argument for the specific table to export
* [airflow] add upstream export job within dags/mozaggregator_mobile.py and update arguments to reflect the alternative processing pathway
* [airflow] add upstream export job within dags/mozaggregator_prerelease.py and update arguments to reflect the alternative processing pathway
I've been notified by Google Cloud Support on case #21629086 that the Dataproc team has been investigating, with an update on 2019-12-23 and 2019-12-26. I'm expecting to hear back again on 2019-12-31. Case management through the console has been unavailable since 2019-12-20, so I have been communicating via email.

In order to improve robustness of the service, I'll be implementing an existing pathway for reading avro dumps of the BigQuery `payload_bytes_decoded` tables instead of relying on the Storage API for reading directly from the service.

I've tested the existing scripts with the following commands in the `python_mozaggregator` repo:

```bash
bin/export-avro.sh moz-fx-data-shar-nonprod-efed amiyaguchi-dev:avro_export gs://amiyaguchi-dev/avro-mozaggregator 2019-12-15

NUM_WORKERS=5 bin/dataproc.sh \
	mobile \
	--output gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/ \
	--num-partitions 200 \
	--date 20191215 \
	--source avro \
	--avro-prefix gs://amiyaguchi-dev/avro-mozaggregator/moz-fx-data-shar-nonprod-efed
```

This results in the following listing
```
$ gsutil ls -r gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/:
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/_SUCCESS

gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/submission_date=20191215/:
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/submission_date=20191215/
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/submission_date=20191215/part-00000-e72b47cc-92e6-4620-a6e0-1844861aebee.c000.snappy.parquet
```

I'm planning on taking the following route to implement reading from avro:

* [mozaggregator] modifying the published docker image `mozilla/python_mozaggregator:latest` to include the google-cloud-sdk
* [mozaggregator] updating `bin/export-avro.sh` to accept an argument for the specific table to export
* [airflow] add upstream export job within dags/mozaggregator_mobile.py and update arguments to reflect the alternative processing pathway
* [airflow] add upstream export job within dags/mozaggregator_prerelease.py and update arguments to reflect the alternative processing pathway

Back to Bug 1605442 Comment 5