Closed Bug 1331871 Opened 7 years ago Closed 7 years ago

[meta] Spark Streaming aggregator prototype

Tracking

(Not tracked)

Status:

RESOLVED INVALID

People

(Reporter: rvitillo, Unassigned)

References

Details

User Story

Foreword: this is an exploratory prototype and the technology stack and assumptions might vary over time

Abstract:
We want to deploy a simple, yet fully functional, Spark aggregation application that creates summary statistics of a set of metrics, like crashes, usage hours & errors (e.g. GEOLOCATION_ERROR) for a given set of a dimensions like channel & build-id etc. The application should emit new data every N (5?) minutes which will be visible to Presto/Athena and consequently redash.

The plan is to have the very same job run both in batch and streaming mode. The batch mode will run from Airflow at UTC midnight on the raw data of the previous day and generate a batch view. The streaming view will keep generating data in perpetuity.

The batch mode comes in handy for various reasons, the most obvious being the generation of bigger Parquet files, which increases the compression ratio and reduces the amount of data Presto has to read at query time. Backfilling is another reason. Stakeholders may want a new error to show up on the dashboards retroactively, for example.

The batch version will produce data on S3 under a different partition than the streaming version (e.g. s3://prefix/dataset/v2/mode=[batch, stream]/…). Presto/Athena could then query the union of the two views. The same definition of time has to be used in both the batch and streaming view.

Goals:
- produce prototypes of the error dashboard that we can show to stakeholders and iterate upon;
- evaluate the challenges of productionizing a Spark Streaming application;
- compare Spark Streaming to Hindsight in terms of performance; we expect Hindsight to be faster and cheaper but we would like to understand by how much;
- pushing a new version of the application to production should only require a PR to be merged in the repository [1];
- reduce operational effort to a minimum;
- use the same code for batch and streaming views;
- [retroactive-]backfills should be easy to do from Airflow;

[1] https://github.com/mozilla/telemetry-streaming

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

7 years ago

      No description provided.

Roberto Agostino Vitillo (:rvitillo)

Reporter