Closed Bug 1253644 Opened 8 years ago Closed 8 years ago

Create derived Parquet dataset for KPIs

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: rvitillo)

References

()

Details

User Story

We are going to need to compute our KPI metrics for different segments, where a segment is a combinations of dimensions (channel, OS, e10s-enabled, ...). 

I propose to write a Spark ETL job that emits, on a scheduled basis, a Parquet dataset with the following per-activity date aggregates:

activity-date, DIM_1, ..., DIM_N, HLL

where HLL is the HyperLogLog (HLL) of the clients in that particular segment, i.e. the (approximate) cardinality of that segment. The Parquet dataset could then be loaded into Presto. As HLL is a monoid it would be easy to determine the cardinality for a particular segment, e.g.:

select e10s-enabled, cardinality(union(hll)) from table where channel='release'
group by e10s-enabled

Unfortunately Spark and Presto don't support HLL in a cross-compatible way but it should be possible to add support for it on our clusters through a custom extension.
      No description provided.
Assignee: nobody → rvitillo
Points: --- → 3
Priority: -- → P1
Blocks: 1255012
Blocks: 1251259
Depends on: 1257615
Blocks: 1256363
See URL for an example dashboard based on the HLL aggregates.

- https://github.com/mozilla/telemetry-batch-view/pull/41
- https://github.com/vitillo/presto-hyperloglog
Status: NEW → RESOLVED
Closed: 8 years ago
User Story: (updated)
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.