Closed Bug 1331871 Opened 7 years ago Closed 7 years ago

[meta] Spark Streaming aggregator prototype

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: rvitillo, Unassigned)

References

Details

User Story

Foreword: this is an exploratory prototype and the technology stack and assumptions might vary over time

Abstract:
We want to deploy a simple, yet fully functional, Spark aggregation application that creates summary statistics of a set of metrics, like crashes, usage hours & errors (e.g. GEOLOCATION_ERROR) for a given set of a dimensions like channel & build-id etc. The application should emit new data every N (5?) minutes which will be visible to Presto/Athena and consequently redash.

The plan is to have the very same job run both in batch and streaming mode. The batch mode will run from Airflow at UTC midnight on the raw data of the previous day and generate a batch view. The streaming view will keep generating data in perpetuity. 

The batch mode comes in handy for various reasons, the most obvious being the generation of bigger Parquet files, which increases the compression ratio and reduces the amount of data Presto has to read at query time. Backfilling is another reason. Stakeholders may want a new error to show up on the dashboards retroactively, for example.

The batch version will produce data on S3 under a different partition than the streaming version (e.g. s3://prefix/dataset/v2/mode=[batch, stream]/…). Presto/Athena could then query the union of the two views. The same definition of time has to be used in both the batch and streaming view.

Goals:
- produce prototypes of the error dashboard that we can show to stakeholders and iterate upon;
- evaluate the challenges of productionizing a Spark Streaming application;
- compare Spark Streaming to Hindsight in terms of performance; we expect Hindsight to be faster and cheaper but we would like to understand by how much;
- pushing a new version of the application to production should only require a PR to be merged in the repository [1];
- reduce operational effort to a minimum;
- use the same code for batch and streaming views;
- [retroactive-]backfills should be easy to do from Airflow;

[1] https://github.com/mozilla/telemetry-streaming
      No description provided.
Blocks: 1251580
Depends on: 1283446
Depends on: 1331876
Depends on: 1331877
Depends on: 1331878
Depends on: 1331880
Depends on: 1331908
Priority: -- → P3
Depends on: 1332684
Depends on: 1332686
Depends on: 1335683
Depends on: 1337048
Depends on: 1337742
Depends on: 1337744
Depends on: 1337816
Depends on: 1338495
User Story: (updated)
We are moving project tracking to https://github.com/orgs/mozilla/projects/2.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.