Closed Bug 1557711 Opened 6 years ago Closed 6 years ago

clients_last_seen Parquet cannot be read by Spark

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

Details

Attachments

(1 file)

The clients_last_seen table in BigQuery is exported to Parquet. In BQ, it's actually clients_last_seen_raw and we have a view on top to produce the days_since_* fields, but we have to materialize those when exporting to Parquet.

We made this switch on 2019-05-21. Dates prior to this had a BQ schema that included days_since_* and they were exported as INT64. Dates after this have the fields materialized as part of the export and they end up as INT32 (IntegerType according to Spark).

This all works fine for Presto and Athena, but Spark has trouble with the conversion. It can read recent days, but throws the following error if reading older days:

org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3://telemetry-parquet/clients_last_seen/v1/submission_date=2019-05-13/part-00000-179a5e25-f1a1-4698-94d2-c63602d88462-c000.snappy.parquet. Column: [days_since_seen], Expected: IntegerType, Found: INT64

One possible solution here might be to manually change the column type in Hive to be INT64; Spark may be able then to convert transparently without throwing an error.

Otherwise, we could change the export logic to produce INT64, but then we'd have to reprocess all previous days. This is likely expensive (it takes 3 hours of Dataproc cluster time per day of processing, plus data egress costs from GCP to AWS). We could attempt to reprocess the existing files on the AWS side instead, though.

The linked PR changes the export to produce INT64 (Long) fields. My expectation is that if we merge that, the Hive table will get updated to match the most recent day's data (INT64) and Spark may become happy with no further effort.

If Spark still isn't happy with the days where the fields are INT32, then I can run a one-off job in Databricks to convert them all to INT64 and I think we should be good.

New days are working well now, but the interim days with INT32-formatted fields still cause errors in Spark.

I am reprocessing those days now in the following notebook: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/129169/command/129172

After that finishes, I will copy in the results to the production directory, and we should be good to go.

I'm moving the new files into place now, and sent out an email notifying folks about the maintenance.

Copies have finished and counts look consistent across days. My tests in Spark make it look like I can query across the full date range without exceptions now.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: