Closed Bug 1489614 Opened 7 years ago Closed 7 years ago

total_hours_sum in clients_daily is invalid for ~18% of records

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgaunt, Unassigned)

Details

(Whiteboard: [DataPlatform])

Attachments

(1 file)

A large number of records in CLIENTS_DAILY have total_hours_sum NOT BETWEEN 0 AND 24. I'm aware there's no input validation for simple measures but that's a larger quantity than I would expect due to random noise. A number of different count queries: https://sql.telemetry.mozilla.org/queries/58806/source I discovered this in the context of an analysis for a pref-flip study where 25% of rows were affected but not with any extra days per client and not any one client showing all of the errors. That is, even after filtering out the affected rows the overall sample size for the study was the same. https://sql.telemetry.mozilla.org/queries/58785/source#152375 Studying the docs I understood the sum of total_time from main_summary grouped by client and date was divided by 3600 before becoming total_hours_sum in clients_daily. With that in mind I pulled all data from both tables for one client with a negative sum overall. https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/31985 Wouldn't total_time of 52 would be .014 after /3600 rather than -47.849167?
I think we should remove this field from clients_daily. Per the documentation at [1], the "totalTime" field is the entire session length, but it is reported for each *sub*session. As such, its sum is a meaningless value. Instead, one should use the "subsession_hours_sum" field, which does what is expected here. There are still some unusual values for subsession_hours_sum, but the fraction of affected records is significantly less. I did some quick checks at [2] and found the rate of rows exceeding 24hrs to be just under 8%, of which about half of those make sense if you divide by the "pings_aggregated_by_this_row" field. There are some cases where the subsession length is expected to be longer than 24hrs, such as if a computer is put to sleep, but I'm not sure if ~4% is expected or not. Regarding total_time being negative, the entry in clients_daily is aggregating several rows in main_summary[3], some of which have negative values for total_time. There don't seem to be any recent negative values for subsession_length, another reason to prefer that field. [1] https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/main-ping.html#totaltime [2] https://sql.telemetry.mozilla.org/queries/58827/source [3] https://sql.telemetry.mozilla.org/queries/58831/source
Points: --- → 1
Priority: -- → P2
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: