clients_last_seen Parquet cannot be read by Spark
Categories
(Data Platform and Tools :: General, defect, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: klukas)
Details
Attachments
(1 file)
The clients_last_seen table in BigQuery is exported to Parquet. In BQ, it's actually clients_last_seen_raw and we have a view on top to produce the days_since_* fields, but we have to materialize those when exporting to Parquet.
We made this switch on 2019-05-21. Dates prior to this had a BQ schema that included days_since_* and they were exported as INT64. Dates after this have the fields materialized as part of the export and they end up as INT32 (IntegerType according to Spark).
This all works fine for Presto and Athena, but Spark has trouble with the conversion. It can read recent days, but throws the following error if reading older days:
org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3://telemetry-parquet/clients_last_seen/v1/submission_date=2019-05-13/part-00000-179a5e25-f1a1-4698-94d2-c63602d88462-c000.snappy.parquet. Column: [days_since_seen], Expected: IntegerType, Found: INT64
| Assignee | ||
Comment 1•6 years ago
|
||
One possible solution here might be to manually change the column type in Hive to be INT64; Spark may be able then to convert transparently without throwing an error.
Otherwise, we could change the export logic to produce INT64, but then we'd have to reprocess all previous days. This is likely expensive (it takes 3 hours of Dataproc cluster time per day of processing, plus data egress costs from GCP to AWS). We could attempt to reprocess the existing files on the AWS side instead, though.
Comment 2•6 years ago
|
||
| Assignee | ||
Comment 3•6 years ago
|
||
The linked PR changes the export to produce INT64 (Long) fields. My expectation is that if we merge that, the Hive table will get updated to match the most recent day's data (INT64) and Spark may become happy with no further effort.
If Spark still isn't happy with the days where the fields are INT32, then I can run a one-off job in Databricks to convert them all to INT64 and I think we should be good.
| Assignee | ||
Comment 4•6 years ago
|
||
New days are working well now, but the interim days with INT32-formatted fields still cause errors in Spark.
I am reprocessing those days now in the following notebook: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/129169/command/129172
After that finishes, I will copy in the results to the production directory, and we should be good to go.
| Assignee | ||
Comment 5•6 years ago
|
||
I'm moving the new files into place now, and sent out an email notifying folks about the maintenance.
| Assignee | ||
Comment 6•6 years ago
|
||
Copies have finished and counts look consistent across days. My tests in Spark make it look like I can query across the full date range without exceptions now.
Updated•3 years ago
|
Description
•