Closed
Bug 1404204
Opened 8 years ago
Closed 5 years ago
Main Summary should not include NULL client ids
Categories
(Data Platform and Tools :: General, enhancement, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: frank, Assigned: relud)
Details
Attachments
(1 file)
We should be able to trust that sample_id is within some bounds; e.g. is a Long. Some partitions [0] have it as NULL, which makes using sample_id that much more difficult.
[0] s3://telemetry-parquet/main_summary/v4/submission_date_s3=20170707/sample_id=__HIVE_DEFAULT_PARTITION__
| Reporter | ||
Comment 1•8 years ago
|
||
Note: __HIVE_DEFAULT_PARTITION__ comes from https://issues.apache.org/jira/browse/SPARK-19887
| Reporter | ||
Comment 2•8 years ago
|
||
OTOH: Should we have NULL client ids? Seems like the schema should send there to the error stream. If we send any client-ids that are not UUIDs[0], then missing ones should be invalid as well.
[0] https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/main/main.4.schema.json#L51
Comment 3•8 years ago
|
||
-1 for setting a default sample_id... if client_id is null, then sample_id should also be null.
Alessio, are there any cases where we still expect client_id to be missing for main pings? Do you remember when the "null client_id during early initialization" thing got fixed?
Flags: needinfo?(alessio.placitelli)
Comment 4•8 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #3)
> -1 for setting a default sample_id... if client_id is null, then sample_id
> should also be null.
>
> Alessio, are there any cases where we still expect client_id to be missing
> for main pings? Do you remember when the "null client_id during early
> initialization" thing got fixed?
No, we don't expect it to be missing from "main-ping"s. The fix for this landed in bug 1233986, Firefox 50 on the 28th of June 2016.
Flags: needinfo?(alessio.placitelli)
Comment 5•8 years ago
|
||
Ok, let's do an analysis to see how common this is post-Firefox-50. I'm inclined to WONTFIX this bug and have analyses limit to data with non-null sample ids where needed.
Comment 6•8 years ago
|
||
SELECT count(*)
FROM main_summary
WHERE submission_date_s3 >= '20170801'
AND sample_id IS NULL
> 23
There are probably more in the raw data (where application != Firefox, specifically), so I'm not sure it's a good idea to reject them at validation time, but we could exclude them from main_summary with little-to-no impact.
Frank, what do you think about skipping these records when generating main_summary?
Flags: needinfo?(fbertsch)
| Reporter | ||
Comment 7•8 years ago
|
||
Sounds good to me.
Component: Pipeline Ingestion → Datasets: Main Summary
Flags: needinfo?(fbertsch)
Priority: -- → P3
Summary: Ingestion should set default sample_id for NULL client ids → Main Summary should not include NULL client ids
Comment 8•5 years ago
|
||
| Assignee | ||
Comment 9•5 years ago
|
||
linked PR should fix this by requiring main pings to have clientId in the decoder
Assignee: nobody → dthorn
Status: NEW → ASSIGNED
Points: --- → 1
Priority: P3 → P1
| Assignee | ||
Updated•5 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Component: Datasets: Main Summary → General
You need to log in
before you can comment on or make changes to this bug.
Description
•