Closed
Bug 1475690
Opened 7 years ago
Closed 6 years ago
Read pings as DataFrames in mozdata
Categories
(Data Platform and Tools :: General, defect, P2)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: amiyaguchi, Unassigned, Mentored)
References
Details
(Whiteboard: [DataPlatform][MozData])
Moztelemetry is used by analysis and batch transformation jobs to access data from the ingestion pipeline.[1] The Dataset API currently returns unstructured RDDs since telemetry schemas have been missing server-side. MainSummaryView an example of imposing structure to the raw data so it can be read as a DataFrame. Since the development of the API, several things have changed to support the more structured DataFrame.
* The creation of mozilla-pipeline-schemas (MPS) as the canonical source of telemetry pings [2]
* Work toward direct-to-parquet in the ingestion pipeline (D2P)
* A `Dataset.toJValue` method for reconstructing extracted pings into their original structure [3]
These developments can be pieced together to provide structured access to raw data as a Spark DataFrame, and additionally parquet. The Dataset API is a candidate location for this functionality, but the underlying schema transformation could also be used elsewhere.
The API call may look something like the following:
> public def toDataFrame(rdd: RDD[Message], schema: JValue): DataFrame
The JSONSchema can be sourced from MPS and transformed into the appropriate Spark SQL schema. There are a few important edge-cases to take into account. One concrete example of this occurs in the main ping, where the nested `payload.histograms` field is defined as a nested map-type. Luckily, parquet and Spark are expressive enough handle types like this.
MPS and moztelemetry are both modeled around the existing ingestion infrasturcture. Pings are batched and partitioned in a hierarchy defined by the ingestion specification.[4] The Dataset API is a wrapper that reads the data from the remote filesysten as a RDD. MPS uses a similar file heirarchy to map schemas to pings at ingestion. Schemas can be mapped to document type like they are in the edge-validator.[5]
Finally, there is some existing work in space, but they are proof-of-concept quality. First is a conversion of some toy datasets from JSON into DataFrames.[6] The second is an implementation for loading data into BigQuery and includes a large test suite.[7]
If all goes well, it may be possible to generate automated "summary" datasets for existing pings in a two-pass approach. There are also likely to be performance and compression gains from serializing data as a column store.
[1] https://github.com/mozilla/moztelemetry
[2] https://github.com/mozilla-services/mozilla-pipeline-schemas
[3] Bug 1419116 - Reassemble heka-formatted records into a single submission structure via moztelemetry
[4] https://docs.telemetry.mozilla.org/concepts/pipeline/http_edge_spec.html
[5] https://github.com/mozilla-services/edge-validator
[6] Bug 1449636 - Convert synthetic main ping json-schema to spark-schema
[7] Bug 1442698 - Map main ping schema to big query schema and load data
| Reporter | ||
Updated•7 years ago
|
| Reporter | ||
Updated•7 years ago
|
Assignee: ecole → nobody
| Reporter | ||
Updated•7 years ago
|
Points: --- → 3
Priority: P3 → P2
Summary: [meta] Read pings as DataFrames in moztelemetry → Read pings as DataFrames in mozdata
| Reporter | ||
Comment 1•7 years ago
|
||
This can be added to mozdata by extending the API to consider individual pings as tables given a schema. This notebook demonstrates main-pings as a Dataframe and their potential applications.[1]
A list of things to resolve:
* How to keep the list of schemas up to date?
* What kinds of schema transformations are available?
* What kind of performance can be expected?
[1] https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/38043/command/38046
Updated•7 years ago
|
Whiteboard: [DataPlatform] → [DataPlatform][MozData]
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
| Assignee | ||
Updated•3 years ago
|
Component: Telemetry APIs for Analysis → General
You need to log in
before you can comment on or make changes to this bug.
Description
•