This will likely become the basis / reference for implementing other summary datasets, so we should make it into a good example of best practices. That entails: - Change from using an intermediate Avro representation to directly storing Spark Dataframes in Parquet form - Stop using the deprecated "DerivedStream" approach and use the "views" approach instead.
Work-in-progress PR here: https://github.com/mozilla/telemetry-batch-view/pull/67
An update on the status of that PR: - Performance is known to be sub-optimal for now. The original `main_summary` code (before this change) took about 30 minutes to process a day's data. The code that used the `DerivedStreams` approach, but changed to SparkSQL types instead of Avro took about 45 minutes. The current code using the `Views` approach takes about 90 minutes. There are plans to rewrite the S3-iterating code, so we will tackle the performance problem at that time. - The `MainSummaryView` test coverage does not include testing of the data serialization due to an incompatibility between versions of the `parquet-avro` library. - The `submission_date_s3` field **is** still present in v3. We also introduce an S3 partition for `sample_id`, which is a string field, but can efficiently be cast to a number when sampling ranges are desired.