Windows 10 Client Distribution Dashboard needs to be rewritten for bigquery
Categories
(Data Platform and Tools :: General, enhancement, P2)
Tracking
(Not tracked)
People
(Reporter: harter, Assigned: wlach)
References
Details
| Reporter | ||
Comment 1•8 years ago
|
||
Comment 2•8 years ago
|
||
| Reporter | ||
Comment 3•8 years ago
|
||
Comment 4•8 years ago
|
||
| Assignee | ||
Comment 5•8 years ago
|
||
| Reporter | ||
Comment 6•8 years ago
|
||
| Reporter | ||
Updated•8 years ago
|
Comment 7•8 years ago
|
||
Updated•7 years ago
|
Comment 8•7 years ago
|
||
| Assignee | ||
Comment 9•6 years ago
|
||
Harter, is this bug still relevant?
Comment 10•6 years ago
|
||
My understanding is that the dashboard's future is up in the air at the moment (though FWIW, it's still one I use from time to time, including recently while tracking patch uptake for bug 1585109).
| Reporter | ||
Comment 11•6 years ago
|
||
The dashboard will be broken with the deprecation of longitudinal. RyanVM is still interested in this dashboard. We can keep this bug to cover the work needed to rebuild the dashboard or we can file a new bug. FWIW, it sounds like we don't have resources to handle the rebuild on the DS team.
| Assignee | ||
Comment 12•6 years ago
•
|
||
Ok. Did it turns out this dashboard is generated from some datasets that are two hive metastore queries against longitudinal (thanks :jason for providing the references):
CREATE VIEW hive.default.windows_10_aggregate AS WITH filtered AS ( SELECT * FROM longitudinal WHERE ((((system_os[1].name = 'Windows_NT') AND (system_os[1].version LIKE '10%')) AND (TRY_CAST("split_part"(build[1].version, '.', 1) AS bigint) >= 47)) AND ("date_diff"('day', "date_parse"(submission_date[1], '%Y-%m-%dT00:00:00Z'), current_date) < 28)) ) , aggregated AS ( SELECT system_os[1].name windows_name , system_os[1].version windows_version , system_os[1].windows_build_number build_number , TRY_CAST(system_os[1].windows_ubr AS varchar) windows_ubr , IF((system_os[1].windows_build_number <= 10240), '1507', IF((system_os[1].windows_build_number <= 10586), '1511', IF((system_os[1].windows_build_number <= 14393), '1607', IF((system_os[1].windows_build_number <= 15063), '1703', IF((system_os[1].windows_build_number <= 16299), '1709', IF((system_os[1].windows_build_number <= 17134), '1803', IF((system_os[1].windows_build_number <= 17763), '1809', IF((system_os[1].windows_build_number <= 18362), '1903', IF((system_os[1].windows_build_number > 18362), 'Insider', null))))))))) build_group , TRY_CAST("split_part"(build[1].version, '.', 1) AS bigint) ff_build_version , normalized_channel , "count"(1) count FROM filtered GROUP BY 1, 2, 3, 4, 5, 6, 7 ORDER BY COUNT DESC ) SELECT * FROM (aggregated CROSS JOIN ( SELECT "sum"(count) total_obs FROM aggregated ) total (total_obs))
CREATE VIEW hive.default.build_distribution AS WITH channel_summary AS ( SELECT normalized_channel , build_group , "max"(build_number) oo , "sum"(count) frequency FROM windows_10_aggregate GROUP BY 1, 2 ORDER BY 2 ASC ) , counts AS ( SELECT normalized_channel , "sum"(frequency) total FROM channel_summary GROUP BY 1 ) SELECT cs.normalized_channel , build_group , (CAST(frequency AS double) / total) frequency , frequency count FROM (channel_summary cs INNER JOIN counts cc ON (cs.normalized_channel = cc.normalized_channel)) ORDER BY cs.oo ASC
We should be able to convert these into bigquery-etl jobs scheduled with airflow easily enough: we can replace the use of longitudinal with a sample of either main_summary or telemetry.main.
I don't feel like we should change anything else about this dashboard at this time: certainly not moving any logic to python_mozetl, which is going away. :)
| Assignee | ||
Comment 14•6 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #8)
If we do end up scheduling tasks via Airflow that re-run the query and
replace all the data, we should be sure to set the task's "catchup"
parameter to false. That seems to effectively mean that the latest run is
expected to produce all the output that's needed (rather than one
incremental task for each day/period/run).
Let's just use the bigquery-etl model of creating per-day views. We can always aggregate inside the dashboard itself. It should be relatively fast.
| Assignee | ||
Comment 15•6 years ago
|
||
Update: Have a draft PR which covers one of the two views that build up this dashboard: https://github.com/mozilla/bigquery-etl/pull/441
It's currently kinda broken but is a start. Another project (porting the mission control v2 dashboard to docker / airflow) is taking most of my cycles, but I intend to return to this soon.
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Updated•6 years ago
|
| Assignee | ||
Comment 16•6 years ago
|
||
Wound up converting the hive views to bigquery views against client_daily (see above PR).
New dashboard is here: https://sql.telemetry.mozilla.org/dashboard/windows-10-client-distributions
The old dashboard now redirects to the new one. All old queries have been unscheduled (new ones update every 2 days by default)
I only ported one of the patch queries in the above, but trivial to do more as required. If there are any issues, please file new bugs!
Updated•3 years ago
|
Description
•