Closed Bug 1258827 Opened 10 years ago Closed 8 years ago

Define shared heuristics for mapping temporal fields in UT pings to temporal analysis buckets

Categories

(Data Platform and Tools :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: spenrose, Unassigned)

Details

UT pings contain several fields recording the date and time of events, which are: - generally based on untrusted client clocks - quite diverse in representation (days since epoch, ISO format, etc.) Mozilla buckets UT pings by two temporal categories: - "session": a single period of use - calendar time (day, week, month, year) Before becoming available for analysis, each ping has been temporally bucketed by a pipeline process that is a black box for the analyst. Each analysis must then make its own decision about how to further or re-bucket the pings by session, week, or whatever bucket is of interest for that analysis. I propose that: 1) For each kind of bucketing (i.e. session and day), there should be one defined heuristic, so that everyone does it the same way. 2) Said heuristics should be used to build pre-analysis datasets, so that as few people as possible have to do it at all. 3) The client team should look into reducing the diversity of representations of time in the UT ping.
I wonder if there is anything we could do to make date measurement more reliable on the client-- like, instead of querying the OS level date, could we ping a date server and use that along with our monotonic clock to set telemetry dates? Bad dates (and especially inconsistent dates) make doing a lot of things a real headache, and the situation is notably more difficult in v4, which has a lot of more dates bouncing around than v2 did. fyi-- the following doc has a few ECDFs about "date skew" per ping, defined as the difference between the date portions of "meta/Timestamp" and "payload/info/subsessionStartDate" (which is not a perfect proxy, since it's conflated with ping lag"). This is rolled up into max and min dateskew per client, and then the difference between max-min dateskew: https://docs.google.com/a/mozilla.com/document/d/1GeFoGc1QU3xvN5z8seBIH4K2eNOr-nxpLwym0ov5tLA/edit?usp=sharing As we have known for a long time, there are enough clients with enough anomalous conditions to make discarding outlier dates problematic and difficult. And it's also interesting how much variability there is between the max and min values within each client-- most of this is probably lag, but it still makes it hard to operate on dates and know how to correct/normalize them. Trusted dates would be a big help!
Priority: -- → P3
Component: Metrics: Pipeline → General
Product: Cloud Services → Data Platform and Tools
I don't think this bug is particularly actionable, so I'm going to close it. There are a couple of related developments to look at for potential solutions: - The Clients Daily dataset[1] which gives a per-calendar-day view by client_id - The Client Count Daily dataset[2] which includes aggregate counts (using HLL) by calendar day *and* submission day. [1] https://docs.telemetry.mozilla.org/concepts/choosing_a_dataset.html [2] https://github.com/mozilla/telemetry-airflow/pull/196
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.