Closed
Bug 1414281
Opened 8 years ago
Closed 8 years ago
Sanitize client_count activity date
Categories
(Data Platform and Tools Graveyard :: Datasets: Client Count, enhancement, P3)
Data Platform and Tools Graveyard
Datasets: Client Count
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: frank, Assigned: relud)
References
Details
Currently there can be all kinds of weird data there, instead we should sanitize like the core_client_count [0].
[0] https://github.com/mozilla/telemetry-airflow/blob/master/jobs/core_client_count_view.sh#L21
| Assignee | ||
Updated•8 years ago
|
Assignee: nobody → dthorn
| Reporter | ||
Comment 1•8 years ago
|
||
This should santize based on the date provided by the environment, not the date of the run (to account for backfill of up to 18 months ago).
:mreid, you mentioned being against this - I don't see any downside to limiting to ~1 year on either side of the submission_date. Care to elaborate?
Flags: needinfo?(mreid)
Comment 2•8 years ago
|
||
I think limiting to a given timeframe is fine, I just mildly object to using regexes to parse dates.
Something like
try(date_parse(activity_date, '%Y-%m-%d')) AS activity_date -- Presto
or
TO_DATE(CAST(UNIX_TIMESTAMP(activity_date, 'yyyy-MM-dd') AS TIMESTAMP)) AS activity_date -- Spark SQL
would mean we also skip "2017-00-99" for example, and we could then use an actual date range in the "--where" clause.
Flags: needinfo?(mreid)
| Reporter | ||
Comment 3•8 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #2)
> I think limiting to a given timeframe is fine, I just mildly object to using
> regexes to parse dates.
>
> Something like
>
> try(date_parse(activity_date, '%Y-%m-%d')) AS activity_date -- Presto
> or
> TO_DATE(CAST(UNIX_TIMESTAMP(activity_date, 'yyyy-MM-dd') AS TIMESTAMP)) AS
> activity_date -- Spark SQL
>
> would mean we also skip "2017-00-99" for example, and we could then use an
> actual date range in the "--where" clause.
I do agree that we should use direct date types, if possible - and it does seem to be possible here. However, we should use an IF statement there, and make them NULL if they fall outside of our range. Limiting in the `--where` clause would cause those records to not be aggregated, which doesn't seem like what we want.
Comment 4•8 years ago
|
||
(In reply to Frank Bertsch [:frank] from comment #3)
> (In reply to Mark Reid [:mreid] from comment #2)
> > would mean we also skip "2017-00-99" for example, and we could then use an
> > actual date range in the "--where" clause.
>
> I do agree that we should use direct date types, if possible - and it does
> seem to be possible here. However, we should use an IF statement there, and
> make them NULL if they fall outside of our range. Limiting in the `--where`
> clause would cause those records to not be aggregated, which doesn't seem
> like what we want.
Do you mean including these records, counted against a "NULL" date?
| Assignee | ||
Comment 5•8 years ago
|
||
We are going to remove activity_date from this dataset and only allow submission_date, in bug 1424411
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
Updated•6 years ago
|
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•