Closed Bug 1605442 Opened 2 months ago Closed 1 month ago

mozaggregator missing prerelease data since 2019-12-18

Categories

(Data Platform and Tools :: Datasets: Telemetry Aggregates, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

(Depends on 1 open bug)

Details

Attachments

(1 file)

The prerelease_telemetry_aggregates job has been failing since 2019-12-18 due to the following error:

Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INTERNAL: request failed: internal error
	at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
	... 24 more

A full stack trace can be found here. This is occurring within the spark-bigquery connector, and occurs before a session to BigQuery can be established.

I can reproduce this within a sandbox account using the development running within the mozilla/python_mozaggregator repository.

bin/dataproc.sh \
	parquet \
	--output gs://amiyaguchi-dev/mozaggregator/parquet_test/bq_nonprod/20191111 \
	--num-partitions 200 \
	--date 20191111 \
	--channels nightly \
	--source bigquery \
	--project-id moz-fx-data-shar-nonprod-efed

The job code and airflow dag for this job have remained largely unchanged. There was one PR to python_mozaggregator 4 days to cache a dataset, but this occurs after a successful bigquery read.

Some other notable events that happened this week include a 0.11.0-beta release of the spark-bigquery connector on 2019-12-17, and the deployment to the Kubernetes-based BigQuery sink and retention policy for the payload_bytes_decoded table in bug 1595132 .

I've opened a ticket with Google Cloud Support (Case#21629086).

In a new dataproc cluster with the spark-bigquery-latest.jar included from gs://spark-lib/bigquery/spark-bigquery-latest.jar in the amiyaguchi-dev sandbox project, I ran the following command inside of the pyspark interpreter.

df = spark.read.format("bigquery").option(
    "table", "moz-fx-data-shar-nonprod-efed.payload_bytes_decoded.telemetry_telemetry__mobile_metrics_v1"
).option("filter", "submission_timestamp >= '2019-12-17' AND submission_timestamp < '2019-12-18'").load()
df.count()

This fails with the same error message as the python_mozaggregator failure:

py4j.protocol.Py4JJavaError: An error occurred while calling o113.count.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.InternalException: com.google.cloud.spark.bigquery.repackaged.io.grpc.Status
RuntimeException: INTERNAL: request failed: internal error
        at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:67)
         ....
                at py4j.GatewayConnection.run(GatewayConnection.java:238)
                ... 1 more
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INTERNAL: request failed: internal error
        at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)

In the BigQuery console, I am able to run the following query within the airflow-dataproc-prod project:

select count(*)
from `moz-fx-data-shar-nonprod-efed`.payload_bytes_decoded.telemetry_telemetry__mobile_metrics_v1
where date(submission_timestamp) = date '2019-12-18'
Assignee: nobody → amiyaguchi

Based on a bit more investigation into the cause, it seems likely that recent changes to the payload_bytes_decoded table are causing issues with the mozaggregator job. Running the following python snippets in a pyspark session on a single node dataproc cluster reveals the following differences.

payload_bytes_decoded

spark.read.format("bigquery").option(
    "table", "moz-fx-data-shar-nonprod-efed.payload_bytes_decoded.telemetry_telemetry__mobile_metrics_v1"
).option("filter", "submission_timestamp >= '2019-12-17' AND submission_timestamp < '2019-12-18'").load().count()

Leads to immediate failure with "Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INTERNAL: request failed: internal error".

telemetry_live

spark.read.format("bigquery").option(
    "table", "moz-fx-data-shar-nonprod-efed.telemetry_live.mobile_metrics_v1"
).option("filter", "submission_timestamp >= '2019-12-17' AND submission_timestamp < '2019-12-18'").load().count()

Leads to a successful load [Stage 0:> (0 + 20) / 183]

There is probably a permission issue that is causing the job to immediately fail, but is not being propagated up by the spark-bigquery connector. Its unclear how the recent changes to data expiration policies on payload_bytes_decoded dataset has on access via the spark-bigquery connector.

:amiyaguchi and I did some fairly rigorous testing of permissions and determined that permissions is not likely to be the issue. We tested payload_bytes_raw, decoded, and error and all failed the same way (once his worker instance was granted access to those datasets in stage). We added a temporary dataset to payload_bytes_decoded in stage with no data in it and access to that did not exhibit the issue.

At this point the best hypothesis is that the issue is on Google's end, and that we should continue to track this in the support case. An additional hypothesis is that the major change to these tables is that they are being populated by the new streaming api instead of batch loads and perhaps that is a contributing factor to something failing on the backend.

I've been notified by Google Cloud Support on case #21629086 that the Dataproc team has been investigating, with an update on 2019-12-23 and 2019-12-26. I'm expecting to hear back again on 2019-12-31. Case management through the console has been unavailable since 2019-12-20, so I have been communicating via email.

In order to improve robustness of the service, I'll be implementing an existing pathway for reading avro dumps of the BigQuery payload_bytes_decoded tables instead of relying on the Storage API for reading directly from the service.

I've tested the existing scripts with the following commands in the python_mozaggregator repo:

bin/export-avro.sh moz-fx-data-shar-nonprod-efed amiyaguchi-dev:avro_export gs://amiyaguchi-dev/avro-mozaggregator 2019-12-15

NUM_WORKERS=5 bin/dataproc.sh \
	mobile \
	--output gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/ \
	--num-partitions 200 \
	--date 20191215 \
	--source avro \
	--avro-prefix gs://amiyaguchi-dev/avro-mozaggregator/moz-fx-data-shar-nonprod-efed

This results in the following listing

$ gsutil ls -r gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/:
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/_SUCCESS

gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/submission_date=20191215/:
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/submission_date=20191215/
gs://amiyaguchi-dev/mozaggregator/mobile_test/nonprod/20191215/submission_date=20191215/part-00000-e72b47cc-92e6-4620-a6e0-1844861aebee.c000.snappy.parquet

I'm planning on taking the following route to implement reading from avro:

  • [mozaggregator] modifying the published docker image mozilla/python_mozaggregator:latest to include the google-cloud-sdk
  • [mozaggregator] updating bin/export-avro.sh to accept an argument for the specific table to export
  • [airflow] add upstream export job within dags/mozaggregator_mobile.py and update arguments to reflect the alternative processing pathway
  • [airflow] add upstream export job within dags/mozaggregator_prerelease.py and update arguments to reflect the alternative processing pathway
Depends on: 1606361

The issue has been traced to an internal configuration error requiring a server-side fix on Google's side. The earliest possible date of the fix will be Monday, 2020-01-13. A public issue tracker has been created at https://issuetracker.google.com/issues/147113808.

As per bug 1606361, the job for mobile aggregates parquet has been modified to read from Avro, but is limited to a 10TB export limit per day set by BigQuery. Extending this to the prerelease aggregates will limit backfill to 4 days per day, since each Avro dump is ~2TB.

Prerelease aggregates are being backfilled as of now, the eta is approximately 72 hours (4 hours for 18 days) until the database is caught up.

An update of the backfill, since it has almost 3 days since the last comment.

The first backfill job for prerelease aggregates started at 2020-01-07 00:58:33 UTC for 2019-12-18. There was a small hiccup with backfill, as the scheduler skipped from 2019-12-23 to 2020-01-06. The following command was run at approximately 2020-01-08 17:41 to resume backfill from the 24th onward:

airflow backfill --start_date 2019-12-24 --end_date 2020-01-05 prerelease_telemetry_aggregates 

There are 6 days of backfill left starting from 2020-01-02, with an average of 3.5 hours per day.

Depends on: 1609546

Backfill is now complete, data should be up to date. The underlying issue with the BigQuery Storage API connector and the payload_bytes tables are still unresolved. However, this job has been modified to read from exported avro data now.

I will update this bug with a few comments in the following days/weeks, related to a timeline of events and postmortem.

Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.