Closed
Bug 1331871
Opened 7 years ago
Closed 7 years ago
[meta] Spark Streaming aggregator prototype
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P3)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: rvitillo, Unassigned)
References
Details
User Story
Foreword: this is an exploratory prototype and the technology stack and assumptions might vary over time Abstract: We want to deploy a simple, yet fully functional, Spark aggregation application that creates summary statistics of a set of metrics, like crashes, usage hours & errors (e.g. GEOLOCATION_ERROR) for a given set of a dimensions like channel & build-id etc. The application should emit new data every N (5?) minutes which will be visible to Presto/Athena and consequently redash. The plan is to have the very same job run both in batch and streaming mode. The batch mode will run from Airflow at UTC midnight on the raw data of the previous day and generate a batch view. The streaming view will keep generating data in perpetuity. The batch mode comes in handy for various reasons, the most obvious being the generation of bigger Parquet files, which increases the compression ratio and reduces the amount of data Presto has to read at query time. Backfilling is another reason. Stakeholders may want a new error to show up on the dashboards retroactively, for example. The batch version will produce data on S3 under a different partition than the streaming version (e.g. s3://prefix/dataset/v2/mode=[batch, stream]/…). Presto/Athena could then query the union of the two views. The same definition of time has to be used in both the batch and streaming view. Goals: - produce prototypes of the error dashboard that we can show to stakeholders and iterate upon; - evaluate the challenges of productionizing a Spark Streaming application; - compare Spark Streaming to Hindsight in terms of performance; we expect Hindsight to be faster and cheaper but we would like to understand by how much; - pushing a new version of the application to production should only require a PR to be merged in the repository [1]; - reduce operational effort to a minimum; - use the same code for batch and streaming views; - [retroactive-]backfills should be easy to do from Airflow; [1] https://github.com/mozilla/telemetry-streaming
No description provided.
Updated•7 years ago
|
Priority: -- → P3
Reporter | ||
Updated•7 years ago
|
User Story: (updated)
Reporter | ||
Comment 1•7 years ago
|
||
We are moving project tracking to https://github.com/orgs/mozilla/projects/2.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•