Closed Bug 1414281 Opened 8 years ago Closed 8 years ago

Sanitize client_count activity date

Categories

(Data Platform and Tools Graveyard :: Datasets: Client Count, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: frank, Assigned: relud)

References

Details

Currently there can be all kinds of weird data there, instead we should sanitize like the core_client_count [0]. [0] https://github.com/mozilla/telemetry-airflow/blob/master/jobs/core_client_count_view.sh#L21
Assignee: nobody → dthorn
Blocks: 1415935
This should santize based on the date provided by the environment, not the date of the run (to account for backfill of up to 18 months ago). :mreid, you mentioned being against this - I don't see any downside to limiting to ~1 year on either side of the submission_date. Care to elaborate?
Flags: needinfo?(mreid)
I think limiting to a given timeframe is fine, I just mildly object to using regexes to parse dates. Something like try(date_parse(activity_date, '%Y-%m-%d')) AS activity_date -- Presto or TO_DATE(CAST(UNIX_TIMESTAMP(activity_date, 'yyyy-MM-dd') AS TIMESTAMP)) AS activity_date -- Spark SQL would mean we also skip "2017-00-99" for example, and we could then use an actual date range in the "--where" clause.
Flags: needinfo?(mreid)
(In reply to Mark Reid [:mreid] from comment #2) > I think limiting to a given timeframe is fine, I just mildly object to using > regexes to parse dates. > > Something like > > try(date_parse(activity_date, '%Y-%m-%d')) AS activity_date -- Presto > or > TO_DATE(CAST(UNIX_TIMESTAMP(activity_date, 'yyyy-MM-dd') AS TIMESTAMP)) AS > activity_date -- Spark SQL > > would mean we also skip "2017-00-99" for example, and we could then use an > actual date range in the "--where" clause. I do agree that we should use direct date types, if possible - and it does seem to be possible here. However, we should use an IF statement there, and make them NULL if they fall outside of our range. Limiting in the `--where` clause would cause those records to not be aggregated, which doesn't seem like what we want.
(In reply to Frank Bertsch [:frank] from comment #3) > (In reply to Mark Reid [:mreid] from comment #2) > > would mean we also skip "2017-00-99" for example, and we could then use an > > actual date range in the "--where" clause. > > I do agree that we should use direct date types, if possible - and it does > seem to be possible here. However, we should use an IF statement there, and > make them NULL if they fall outside of our range. Limiting in the `--where` > clause would cause those records to not be aggregated, which doesn't seem > like what we want. Do you mean including these records, counted against a "NULL" date?
We are going to remove activity_date from this dataset and only allow submission_date, in bug 1424411
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.