Closed Bug 1253644 Opened 9 years ago Closed 9 years ago

Create derived Parquet dataset for KPIs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: rvitillo)

References

(
URL
)

Details

User Story

We are going to need to compute our KPI metrics for different segments, where a segment is a combinations of dimensions (channel, OS, e10s-enabled, ...). 

I propose to write a Spark ETL job that emits, on a scheduled basis, a Parquet dataset with the following per-activity date aggregates:

activity-date, DIM_1, ..., DIM_N, HLL

where HLL is the HyperLogLog (HLL) of the clients in that particular segment, i.e. the (approximate) cardinality of that segment. The Parquet dataset could then be loaded into Presto. As HLL is a monoid it would be easy to determine the cardinality for a particular segment, e.g.:

select e10s-enabled, cardinality(union(hll)) from table where channel='release'
group by e10s-enabled

Unfortunately Spark and Presto don't support HLL in a cross-compatible way but it should be possible to add support for it on our clusters through a custom extension.

Roberto Agostino Vitillo (:rvitillo)

Assignee

Description

•

9 years ago

No description provided.

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

9 years ago

Assignee: nobody → rvitillo

Rob Miller [:rmiller]

Updated

•

9 years ago

Points: --- → 3

Priority: -- → P1

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

9 years ago

Blocks: 1255012

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

9 years ago

Blocks: 1251259

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

9 years ago

Depends on: 1257615

Chris H-C :chutten

Updated

•

9 years ago

Blocks: 1256363

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 1

•

9 years ago

See URL for an example dashboard based on the HLL aggregates. - https://github.com/mozilla/telemetry-batch-view/pull/41 - https://github.com/vitillo/presto-hyperloglog

Status: NEW → RESOLVED

Closed: 9 years ago

User Story: (updated)

Resolution: --- → FIXED

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

9 years ago

URL: https://sql.telemetry.mozilla.org/das...

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Create derived Parquet dataset for KPIs

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Assigned: rvitillo)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Updated

Updated

Comment 1

Updated

Updated