main_summary dataset should validate/clean dates

RESOLVED INCOMPLETE

Status

Cloud Services
Metrics: Pipeline
P3
normal
RESOLVED INCOMPLETE
2 years ago
3 months ago

People

(Reporter: ddurst, Unassigned)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
Created attachment 8802633 [details]
5 csv files of the different permutations

Chutten and I came across this oddity when looking at s.t.m.o for some forensic analysis on orphan updating: your results may be different using different kinds of dashes (probably in error) as part of the date "format" (%Y-%m-%d).

So I ran a small test with https://sql.telemetry.mozilla.org/queries/1463/source#table.

I've attached the results of running that query with five permutations of the date format of '2016-10-20':
1. H: both dashes are hyphens
2. N1: first dash is a hyphen, second is an ndash
3. N2: both dashes are ndashes
4. M1: first dash is a hyphen, second is an mdash
5. M2: both dashes are mdashes

Diff'd the output and 1, 2, and 4 are all equal. 3 and 5 are equal. So this tells me that the character is not really the issue, but rather how that character is treated/translated.

1 and 3 are not equal: 1 contains results in 2016 and 3 only has results with year > 2016.

So what happens if someone erroneously formats a date with non-hyphens? How are hyphens being handled on the back-end?
(Reporter)

Comment 1

2 years ago
robotblake made a good point in IRC via regex experiment: most likely, the activity_date is being compared as strings (duh), so xxxx— vs xxxx&hyphen diffs are probably due to just ASCII values.

But this begs a different question, then: if we need to query a specific date, what's the best (read: safest) non-ASCII way to do it?

Use a regex to remove non-numerics? try to cast to a date type (all such efforts have failed)? format as a date string?

Comment 2

2 years ago
is this something your team can take Rob?
Flags: needinfo?(rmiller)
Yes, we can do this, I've added an issue to our tracker: https://github.com/mozilla/redash/issues/11
Flags: needinfo?(rmiller)
robotblake is correct, the 'activity_date' column in this table is a string, not a date, and '>=' therefore does string comparison.

Here is a version of the query that parses the date column and compares it to a date value (after first using a regex to ignore rows with invalid date syntax):

SELECT *
FROM client_count
WHERE
	normalized_channel = 'Other'
	AND regexp_like(activity_date, '\d{4}-\d{2}-\d{2}')
	AND from_iso8601_date(activity_date) >= date '2016-10-20'
	AND country = 'CN'
	AND os_version = '10.0'
	AND app_version = '45.3.0'
ORDER BY activity_date ASC
(Reporter)

Comment 5

2 years ago
(O hai, ashort!)

The only problem with that is the "first using a regex to ignore rows with invalid date syntax" -- if those rows have invalid date syntax, are we guaranteed that they have invalid date *data* (i.e. we are safe to exclude them)?

So, basically, your comment confirms the issue. Aside from the question of excluding rows based on syntax, is this a preferred way of working around this issue, or is some other solution going to be implemented that obviates it?

I understand the issue isn't *with* redash, but I'd expect either a function to safely cast to date types (which would force consistent handling of any of - – —) or something implicit that would do the same (without any regex conversions).
Flags: needinfo?(rmiller)
Flags: needinfo?(ashort)
The best thing to do would clearly be to fix the data source. Putting the burden of safe conversion and comparison on the querying tool is always going to be brittle; there's no way we're going to be able to account for all permutations of what might end up in the database. We can try to provide something as a band-aid, but is there a bug open anywhere for someone to turn this field into an actual date field so we fix the root of the issue?
Flags: needinfo?(rmiller)
(In reply to David Durst [:ddurst] from comment #5) 
> The only problem with that is the "first using a regex to ignore rows with
> invalid date syntax" -- if those rows have invalid date syntax, are we
> guaranteed that they have invalid date *data* (i.e. we are safe to exclude
> them)?

The error I got when leaving out this regex was 'Cannot parse "22951-04-0"' so there are clearly some entries that just aren't recognizable as dates at all.
Flags: needinfo?(ashort)
Summary: Dashes in dates in s.t.m.o are not treated equally? → main_summary dataset should validate/clean dates

Comment 8

2 years ago
We should convert the data in the main summary dataset to use actual parquet "Date" types. Then we can treat them in a sane way, with any assumptions documented accordingly (like using null for invalid dates).
(In reply to Rob Miller [:RaFromBRC :rmiller] from comment #6)
> The best thing to do would clearly be to fix the data source. Putting the
> burden of safe conversion and comparison on the querying tool is always
> going to be brittle; there's no way we're going to be able to account for
> all permutations of what might end up in the database. We can try to provide
> something as a band-aid, but is there a bug open anywhere for someone to
> turn this field into an actual date field so we fix the root of the issue?

We tried using a proper date type in Parquet in the past but the version of Presto we were using at the time had some issues with it. We should verify that our current version of Presto still has this issue.

Updated

2 years ago
Points: --- → 3
Priority: -- → P3
Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972
Status: NEW → RESOLVED
Last Resolved: 10 months ago
Resolution: --- → INCOMPLETE
See Also: → bug 1456264
You need to log in before you can comment on or make changes to this bug.