revisit os and os_version across datasets
Categories
(Data Platform and Tools :: General, task, P3)
Tracking
(Not tracked)
People
(Reporter: mconnor, Unassigned)
References
Details
(Whiteboard: [dataplatform])
With the release of Windows 11, our current model for handling OS versions is not working well. The actual internal version for Windows is >= 10.0.21996, which is then truncated to 10.0. The only way to distinguish, if available, is by using windows_build_number.
As a short term workaround, we could update the ETL job to munge those versions to 11.0, but there's a larger question.
Looking at these fields, and also bug 1659455, this should likely be a trigger for revisiting how we handle os / os_version. The current approach is built on OS kernel versions, which are relatively opaque, requiring dataset consumers to translate these versions to the user-facing versions. There's also significant inconsistency in the level of detail in these fields, with only 9 "Windows_NT" versions, 81 Mac versions, and 2500+ Linux versions.
If there are concerns about breaking existing queries (I'm not sure who all of the consumers are), then it may make sense to start by creating "normalized" fields. Ben Wu took at stab at this a while back, so we have a starting point. I'm not sure enough people use these that we shouldn't just upgrade os_version to be useful...
What I'd propose is something like:
normalized_os_name
- Windows, Linux, macOS, etc
normalized_os_version
- WIndows: 95, 98, 2000, 2003 Server, XP, Vista, 7, 8, 8.1, 10, 11 using this table as a starting point.
- Linux: major+minor kernel version
- Mac: 10.x, 11, 12 (using this table as a guide)
normalized_os_patch_level
- Windows: use windows_build_number here
- Linux: use the patch level (third part of the kernel string)
- Mac: use the Darwin kernel version
In practice, I'd expect most queries would use the first one or two, with the third as a cross-platform replacement for windows_build_number.
Key use cases:
- Easy creation of human-readable tables and graphs breaking down queries by os/version (no need to write case statements to munge data, or manually labelling graphs with the user-facing versions)
- More intuitive, human-readable authoring of queries
** Today: WHERE os = 'Darwin' and os_version LIKE '19%'
** Future: WHERE os = 'macOS' and os_version = '10.15'
Comment 1•4 years ago
|
||
normalized_os
The top-level normalized_os field of pings is populated via NormalizeAttributes in the ingestion pipeline and should produce one of "Windows", "Linux", or "Mac".
I think of this top-level normalization as a solved problem, and I think we should stick with the values that currently exist. But there is likely a gap in terms of documentation and users knowing about what to expect in this field vs. nested fields within the ping that contain more raw values.
:mconnor - Is normalized_os as currently constructed useful for the role of normalized_os_name you describe?
normalized_os_version
At the looker level, we now have a static namespace where we've implemented country lookup that can be easily mixed into explores in order to translated from country codes to logical country names, etc. I would hope we can do similar here, although the mapping is not quite as straight-forward.
There's a problem here of process - if we build the relevant tables to translate from kernel versions to logical OS versions, how do we make sure we keep that up to date? This is the biggest chunk of work here.
normalized_os_patch_level
As long as I'm understanding correctly that this is essentially a fallback to the raw value provided in the ping, this sounds reasonable.
Comment 2•4 years ago
|
||
This may be a problem best considered under the new analytics engineering group under :gkabbz.
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Description
•