Refactor MainSummary dataset generation code

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
P1
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: mreid, Assigned: mreid)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

2 years ago
This will likely become the basis / reference for implementing other summary datasets, so we should make it into a good example of best practices.

That entails:
- Change from using an intermediate Avro representation to directly storing Spark Dataframes in Parquet form
- Stop using the deprecated "DerivedStream" approach and use the "views" approach instead.
(Assignee)

Updated

2 years ago
Assignee: nobody → mreid
Points: --- → 3
Priority: -- → P1
(Assignee)

Updated

2 years ago
(Assignee)

Comment 1

2 years ago
Work-in-progress PR here:
https://github.com/mozilla/telemetry-batch-view/pull/67
(Assignee)

Comment 2

2 years ago
An update on the status of that PR:
- Performance is known to be sub-optimal for now.  The original `main_summary` code (before this change) took about 30 minutes to process a day's data. The code that used the `DerivedStreams` approach, but changed to SparkSQL types instead of Avro took about 45 minutes. The current code using the `Views` approach takes about 90 minutes. There are plans to rewrite the S3-iterating code, so we will tackle the performance problem at that time.
- The `MainSummaryView` test coverage does not include testing of the data serialization due to an incompatibility between versions of the `parquet-avro` library.
- The `submission_date_s3` field **is** still present in v3. We also introduce an S3 partition for `sample_id`, which is a string field, but can efficiently be cast to a number when sampling ranges are desired.
(Assignee)

Updated

2 years ago
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.