Closed
Bug 1253644
Opened 9 years ago
Closed 9 years ago
Create derived Parquet dataset for KPIs
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Assigned: rvitillo)
References
()
Details
User Story
We are going to need to compute our KPI metrics for different segments, where a segment is a combinations of dimensions (channel, OS, e10s-enabled, ...). I propose to write a Spark ETL job that emits, on a scheduled basis, a Parquet dataset with the following per-activity date aggregates: activity-date, DIM_1, ..., DIM_N, HLL where HLL is the HyperLogLog (HLL) of the clients in that particular segment, i.e. the (approximate) cardinality of that segment. The Parquet dataset could then be loaded into Presto. As HLL is a monoid it would be easy to determine the cardinality for a particular segment, e.g.: select e10s-enabled, cardinality(union(hll)) from table where channel='release' group by e10s-enabled Unfortunately Spark and Presto don't support HLL in a cross-compatible way but it should be possible to add support for it on our clusters through a custom extension.
No description provided.
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → rvitillo
Updated•9 years ago
|
Points: --- → 3
Priority: -- → P1
Assignee | ||
Comment 1•9 years ago
|
||
See URL for an example dashboard based on the HLL aggregates.
- https://github.com/mozilla/telemetry-batch-view/pull/41
- https://github.com/vitillo/presto-hyperloglog
Status: NEW → RESOLVED
Closed: 9 years ago
User Story: (updated)
Resolution: --- → FIXED
Assignee | ||
Updated•9 years ago
|
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•