Closed Bug 1456264 Opened 7 years ago Closed 5 years ago

Add a meta-field with cleaned time values to main summary

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: amiyaguchi, Unassigned)

References

Details

The Firefox telemetry data has an inconsistent view on time due to the difference between client and server point of view (POV). Large differences are expected in small amounts in our client population because of odd use-cases like multi-year long hibernations or dead CMOS batteries. We can correct for the most egregious values by cross referencing third party data-sources or by creating a separate null class of values from robust statistical methods. There are three parts to this bug: 1. Creating a new meta field representing all date and time fields in the main ping as a native parquet date or timestamp field. See [1] for a list of fields, their examples, and ways to parse. See [2] for an old bug dug up from the graveyard. See [3] for some potential pitfalls. 2. Identify subpopulations that are at risk of a warped view of time. This can be done through domain knowledge or statistical data mining, but any applied corrections because of this identification should be annotated in the resulting dataset. This helps deal with potential bias in the future. Statistical significance tests help validate assumptions and corrections for stronger production confidence. 3. Using the new meta field, correct time in a derived dataset. The original dataset can be left as-is and all existing queries can be modified to use the new native date-fields. A derived dataset can reprocess the time fields and manually correct them with exogenous data sources (telemetry_new_profile_parquet, first_shutdown_summary, etc). Additionally, its possible to automate data cleaning into a separate class null or outliers class with some form of quantitative data cleaning. These derived datasets can be cheaply made and quickly modified as new understandings of the underlying data come to light. Adding a separated date/timestamp field in the summary view will make it easier to work with time as a user and developer. There won't be a need to use UDFs like `from_unixtimestamp` or `date_parse`. It will be a consistent interface into client time. Finally, it is an easy target for derived datasets (a materialized view of a join between a summary dataset and a client correction table, for performance reasons). [1] https://docs.google.com/spreadsheets/d/1DQk0YQx2PLaY2ZMTdhF2TA-ueZ6g5PWwUqbPWL90AHM/edit#gid=0 [2] Bug 1311487 - main_summary dataset should validate/clean dates [3] Bug 1425055 - subsession_start_date not compatible with presto's `from_iso8601_timestamp` [4] Bug 1449739 - Investigate possible bug in Profile Creation Date calculation
This is a notebook with the following values that have been transformed into timestamps.[1][2] profile_creation: timestamp value of profile creation reported by the client clock creation: timestamp value of ping creation as seen by the client clock subsession_start: timestamp value of subsession start as seen by the client clock client_submission: timestamp value of ping submission as seen by the client clock submission: timestamp value of ping submission as seen by the ingestion clock There are likely ill-formatted values that need to be dealt with in the future. [1] https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/12117 [2] https://gist.github.com/acmiyaguchi/dcee9255fcea7c7caeaa947f6e1d737a
Assignee: nobody → jklukas
Status: NEW → ASSIGNED
Priority: P3 → P1
Priority: P1 → P2
Assignee: jklukas → nobody
Status: ASSIGNED → NEW
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: Datasets: Main Summary → General
You need to log in before you can comment on or make changes to this bug.