Closed Bug 1475690 Opened 7 years ago Closed 6 years ago

Read pings as DataFrames in mozdata

Categories

(Data Platform and Tools :: General, defect, P2)

defect
Points:
3

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: amiyaguchi, Unassigned, Mentored)

References

Details

(Whiteboard: [DataPlatform][MozData])

Moztelemetry is used by analysis and batch transformation jobs to access data from the ingestion pipeline.[1] The Dataset API currently returns unstructured RDDs since telemetry schemas have been missing server-side. MainSummaryView an example of imposing structure to the raw data so it can be read as a DataFrame. Since the development of the API, several things have changed to support the more structured DataFrame. * The creation of mozilla-pipeline-schemas (MPS) as the canonical source of telemetry pings [2] * Work toward direct-to-parquet in the ingestion pipeline (D2P) * A `Dataset.toJValue` method for reconstructing extracted pings into their original structure [3] These developments can be pieced together to provide structured access to raw data as a Spark DataFrame, and additionally parquet. The Dataset API is a candidate location for this functionality, but the underlying schema transformation could also be used elsewhere. The API call may look something like the following: > public def toDataFrame(rdd: RDD[Message], schema: JValue): DataFrame The JSONSchema can be sourced from MPS and transformed into the appropriate Spark SQL schema. There are a few important edge-cases to take into account. One concrete example of this occurs in the main ping, where the nested `payload.histograms` field is defined as a nested map-type. Luckily, parquet and Spark are expressive enough handle types like this. MPS and moztelemetry are both modeled around the existing ingestion infrasturcture. Pings are batched and partitioned in a hierarchy defined by the ingestion specification.[4] The Dataset API is a wrapper that reads the data from the remote filesysten as a RDD. MPS uses a similar file heirarchy to map schemas to pings at ingestion. Schemas can be mapped to document type like they are in the edge-validator.[5] Finally, there is some existing work in space, but they are proof-of-concept quality. First is a conversion of some toy datasets from JSON into DataFrames.[6] The second is an implementation for loading data into BigQuery and includes a large test suite.[7] If all goes well, it may be possible to generate automated "summary" datasets for existing pings in a two-pass approach. There are also likely to be performance and compression gains from serializing data as a column store. [1] https://github.com/mozilla/moztelemetry [2] https://github.com/mozilla-services/mozilla-pipeline-schemas [3] Bug 1419116 - Reassemble heka-formatted records into a single submission structure via moztelemetry [4] https://docs.telemetry.mozilla.org/concepts/pipeline/http_edge_spec.html [5] https://github.com/mozilla-services/edge-validator [6] Bug 1449636 - Convert synthetic main ping json-schema to spark-schema [7] Bug 1442698 - Map main ping schema to big query schema and load data
Blocks: 1429634
Whiteboard: [DataPlatform]
Depends on: 1477413, 1478444, 1429902
Assignee: ecole → nobody
Points: --- → 3
Priority: P3 → P2
Summary: [meta] Read pings as DataFrames in moztelemetry → Read pings as DataFrames in mozdata
This can be added to mozdata by extending the API to consider individual pings as tables given a schema. This notebook demonstrates main-pings as a Dataframe and their potential applications.[1] A list of things to resolve: * How to keep the list of schemas up to date? * What kinds of schema transformations are available? * What kind of performance can be expected? [1] https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/38043/command/38046
Whiteboard: [DataPlatform] → [DataPlatform][MozData]
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Component: Telemetry APIs for Analysis → General
You need to log in before you can comment on or make changes to this bug.