Closed Bug 1311796 Opened 8 years ago Closed 8 years ago

Create TxP Pings Dataset

Categories

(Data Platform and Tools :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bugzilla, Assigned: bugzilla)

References

Details

- Investigate feasibility of creating a dataset that would be sufficient to run all the TxP dashboard queries - Create said dataset if possible
Blocks: 1255755
No longer depends on: 1255755
Creating a dataset for 'testpilot' type pings is fairly straightforward using the ping format documented here: https://github.com/mozilla/testpilot/blob/master/docs/metrics/telemetry.md#testpilot-summary-ping Creating a dataset for 'testpilottest' pings is a bit hairier since every test pilot test will have their own format and specifying the schema for each new test + format change would be a maintenance headache (envelope format documented here: https://github.com/mozilla/testpilot/blob/master/docs/metrics/telemetry.md#per-experiment-testpilottest-ping). I'd like to experiment with generating a schema programmatically (group the testpilottest pings by test, read the distinct fields in all the 'payload/payload' objects and add those to the standard fields we'll want in the dataset like client_id, submission_date, etc.) A couple concerns here would be: - Schema changes (might be solved by adding a version field to the 'testpilottest' envelope?) - Nested objects within the inner payload (doesn't seem to be an issue yet with the testpilottest formats I've seen, so perhaps this is a concern we can defer until it presents an issue.) - Types: the python version of createDataFrame will infer types for columns, but it's looking like the spark version will take a little more work. I do think if this works this pattern will be useful in the future for other ping types as well in the future. Any objections or other concerns before I dive into implementation?
As discussed on Monday you should make sure that this is still worth doing it considering that we are adding a Parquet sink for Hindsight.
Given Trink expects the sink to be ready on Monday, so I'm going to hold off on this and see if this would be doable with what he's building.
Component: Metrics: Pipeline → Datasets: General
Product: Cloud Services → Data Platform and Tools
Ancient bug -- isn't worth doing at this point given the txp team's reduced usage of telemetry
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.