Closed Bug 1442698 Opened 7 years ago Closed 7 years ago

Map main ping schema to big query schema and load data

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Attachments

(1 file)

There is existing working in this space.
Points: --- → 2
Priority: -- → P1
Script to convert jsonschema to bigquery schema, that drops under-specified fields: https://github.com/relud/telemetry-sample/blob/master/scripts/jsonschema_to_bigquery.py main_summary schema, and the bigquery schema output by the above script: https://github.com/relud/telemetry-sample/tree/master/schemas
Depends on: 1446518
Priority: P1 → P2
Priority: P2 → P1
No longer depends on: 1446518
So it looks like the histogram fields will be tricky for importing directly into big query, since the schema builder is going to have to encode some main ping specific logic into it. The object types that are actually sparse maps in the json schema are going to be troublesome, for example histograms. This is because the big query schema is more akin to a traditional database schema where all of the columns are declared up front. Existing interfaces to the data (moztelemetry) assume a nosql or document-store approach to data access. Here, the json schema is a declarative description of the structure of the document as opposed to true table schema. Here's an example that clarifies what I mean: The following schema will accept an object that contains objects that are in the shape of `additionalProperties`. ``` { "$schema": "http://json-schema.org/draft-04/schema#", "type": "object", "additionalProperties": { "type": "object", "properties": { "shape": {"type": "string"}, "color": {"type": "string"} }, "required": ["shape", "color"] } } ``` The following object will validate against this. ``` { "object_1": {"shape": "circle", "color": "blue"}, "object_2": {"shape": "triangle", "color": "red" }, "object_4": {"shape": "square", "color": "red"} } ``` Assume we receive some number of these documents, where the object can contain any combination of keys that match the regex `object_[1-5]`. Despite this being a valid schema and a valid document, there is not quite enough information to generate the full schema that recovers all of the data that we want. The json schema validator only cares whether the object properties are all of the same shape, where the number of properties is variable. This is useful when the columns are sparse and are frequently updated or omitted from the document, which is the case for most of the telemetry measurements. In the main ping, the one object type of concern here is the histograms field, which is a sparse object that contains histograms with sparse value fields. Keyed histograms have an extra layer of this sparsity. The solution here is to encode information about sparse fields into the schema builder. There are two techniques for getting this information: 1. sample the incoming data and recover the superset of keys in each sparse field 2. use a known list of all the keys In the context of histograms in the main ping, MainSummaryView in `telemetry-batch-view` tackles this problem by generating the list of known histograms from `histograms.json` to convert the sparse map into a concrete column. However, it does not tackle sparsity in the histogram values since parquet supports Map types in the schema. The approach here then is probably a mixture of the two techniques to recover field names, as well as extending the schema transpiler script to accept a list of concrete properties when `additionalProperties` is set to an object. ========== As a proof of concept, the `payload.histograms` field should be populated directly from the schema. This transformation requires extra context to perform. This is my rough angle of attack. 1. parse `histograms.json` for valid histograms. `com.mozilla.telemetry.metrics.Histogram` will probably be the point of reference. 2. recover the bucket keys for each histogram by sampling main-pings and taking the union of all keys seen so far [0] 3. pass this information to the json-bq schema transpiler as context 4. verify that the generated schema extracts the correct data from the document [0] https://gist.github.com/acmiyaguchi/32e6c124927807786ad47c85d908fcca
Depends on: 1442699
bigquery has functions for accessing json stored in string columns, maybe that could help.
That sounds like a good trade-off. It's probably still a good idea to break out each of the histograms into a separate column to reduce the total amount of deserialization. In that case, step 2 would be replaced with "serialize each histogram (or histogram.values) object as a json". On the processing side, there could be a few UDFs that can aggregate and summarize histograms given the object. I still need to get myself familiarized with the code, but it seems like there will need to be modifications to prevent histograms from being completely dropped by the script. In any case, I'd like to do the minimal amount of manual ETL for this entire process and mostly let the big-query tooling do most of the work.
Depends on: 1413381
Translating the json schema into a big-query schema has the same issues as translating into a spark schema. There is a related bug 1449636.
See Also: → 1449636
Blocks: 1450135
This is complete, data is available in object storage and ready to be loaded. I am running into annoying errors about fields having to start with a letter or underscore (despite being valid parquet column names). To be continued in a follow-up bug about performance on this small dataset. ``` Recent Jobs Errors: query: Invalid field name "96". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. Table: main_v2_85ebc234_a2a0_4d52_b5ad_c0df932b7bab_source (error code: invalidQuery) ```
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: