Closed Bug 1399532 Opened 7 years ago Closed 5 years ago

Windows 10 Client Distribution Dashboard needs to be rewritten for bigquery

Categories

(Data Platform and Tools :: General, enhancement, P2)

x86_64
Linux
enhancement
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: harter, Assigned: wlach)

References

Details

The Windows 10 Client Dashboard [1] is actively used by QA/Release engineering to monitor uptake of new Windows releases.

This dashboard heavily uses virtual tables to share logic between queries. However, this makes the dashboard very difficult to maintain. All upstream queries which create the virtual tables cannot be collaborated on and the SQL queries are not version controlled. 

Instead, we should have a derived dataset generated in python_mozetl with a few simple queries feeding the dashboard.

[1] https://sql.telemetry.mozilla.org/dashboard/windows-10-user-strata
Will, do you have any interest in working on this?
Flags: needinfo?(wlachance)
Drive-by comment: I would rather have a versioned queries, and the capabilities that come with that, than have to create tedious ETL jobs for things like views. If we followed-through with bug 1274990 (which really is not very difficult), we could have a separate repo for scheduled queries.
What advantage does creating a new Airflow operator have over keeping this logic in python_mozetl? Airflow has proven to be difficult to test. I could see a simple python wrapper for executing a query and saving the result to S3. 

I guess the cost is booting up a new cluster for simple ETL?
Flags: needinfo?(fbertsch)
(In reply to Ryan Harter [:harter] from comment #3)
> What advantage does creating a new Airflow operator have over keeping this
> logic in python_mozetl? Airflow has proven to be difficult to test. I could
> see a simple python wrapper for executing a query and saving the result to
> S3. 
> 
> I guess the cost is booting up a new cluster for simple ETL?

I suppose in reality these are separate asks:
1. If we use Presto (i.e. a new Airflow operator), it doesn't support materialized views; the view query would need to run every time a query that used it runs.
2. If we use an ETL job, it materializes that data, which will be faster for the queries that use it. Downsides are more testing and maintaining of YAETL (yet-another-ETL ;)).

SQL queries are meant to be declarative and simpler. I would much appreciate the ability to schedule and use them in Airflow; this bug just seemed like a good vehicle to do so :)
Flags: needinfo?(fbertsch)
(In reply to Ryan Harter [:harter] from comment #1)
> Will, do you have any interest in working on this?

Sorry for the late reply, yes I am interested. I defer to you and Frank on the best way of implementing this, the python-mozetl approach would probably be a lot easier for me to implement but am happy to work on the airflow operator as well if it seems to make more sense.
Flags: needinfo?(wlachance)
(In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> (In reply to Ryan Harter [:harter] from comment #1)
> > Will, do you have any interest in working on this?
> 
> Sorry for the late reply, yes I am interested. 

Awesome!

> I defer to you and Frank on
> the best way of implementing this, the python-mozetl approach would probably
> be a lot easier for me to implement but am happy to work on the airflow
> operator as well if it seems to make more sense.

I think python-mozetl is a better fit for this task since it would be better if these datasets were materialized. It currently takes ~5-10m to run a query which makes it difficult to debug problems.

As noted in Bug 1393830, this isn't an immediate need. This has, however, been a maintenance burden so I'd like to make progress while we can.

I'll set up a meeting next week to discuss.
Assignee: nobody → wlachance
FYI , I am going to be requesting an update to the dashboard within the next couple weeks once the Fall Creators Update is finalized. Not saying this needs to change any immediate plans, just a heads-up.
Component: General → Datasets: General
If we do end up scheduling tasks via Airflow that re-run the query and replace all the data, we should be sure to set the task's "catchup" parameter to false. That seems to effectively mean that the latest run is expected to produce all the output that's needed (rather than one incremental task for each day/period/run).

See:
https://airflow.apache.org/scheduler.html#backfill-and-catchup

Harter, is this bug still relevant?

Assignee: wlachance → nobody
Flags: needinfo?(rharter)

My understanding is that the dashboard's future is up in the air at the moment (though FWIW, it's still one I use from time to time, including recently while tracking patch uptake for bug 1585109).

The dashboard will be broken with the deprecation of longitudinal. RyanVM is still interested in this dashboard. We can keep this bug to cover the work needed to rebuild the dashboard or we can file a new bug. FWIW, it sounds like we don't have resources to handle the rebuild on the DS team.

Flags: needinfo?(rharter)

Ok. Did it turns out this dashboard is generated from some datasets that are two hive metastore queries against longitudinal (thanks :jason for providing the references):

CREATE VIEW hive.default.windows_10_aggregate AS WITH filtered AS ( SELECT * FROM longitudinal WHERE ((((system_os[1].name = 'Windows_NT') AND (system_os[1].version LIKE '10%')) AND (TRY_CAST("split_part"(build[1].version, '.', 1) AS bigint) >= 47)) AND ("date_diff"('day', "date_parse"(submission_date[1], '%Y-%m-%dT00:00:00Z'), current_date) < 28)) ) , aggregated AS ( SELECT system_os[1].name windows_name , system_os[1].version windows_version , system_os[1].windows_build_number build_number , TRY_CAST(system_os[1].windows_ubr AS varchar) windows_ubr , IF((system_os[1].windows_build_number <= 10240), '1507', IF((system_os[1].windows_build_number <= 10586), '1511', IF((system_os[1].windows_build_number <= 14393), '1607', IF((system_os[1].windows_build_number <= 15063), '1703', IF((system_os[1].windows_build_number <= 16299), '1709', IF((system_os[1].windows_build_number <= 17134), '1803', IF((system_os[1].windows_build_number <= 17763), '1809', IF((system_os[1].windows_build_number <= 18362), '1903', IF((system_os[1].windows_build_number > 18362), 'Insider', null))))))))) build_group , TRY_CAST("split_part"(build[1].version, '.', 1) AS bigint) ff_build_version , normalized_channel , "count"(1) count FROM filtered GROUP BY 1, 2, 3, 4, 5, 6, 7 ORDER BY COUNT DESC ) SELECT * FROM (aggregated CROSS JOIN ( SELECT "sum"(count) total_obs FROM aggregated ) total (total_obs))
CREATE VIEW hive.default.build_distribution AS WITH channel_summary AS ( SELECT normalized_channel , build_group , "max"(build_number) oo , "sum"(count) frequency FROM windows_10_aggregate GROUP BY 1, 2 ORDER BY 2 ASC ) , counts AS ( SELECT normalized_channel , "sum"(frequency) total FROM channel_summary GROUP BY 1 ) SELECT cs.normalized_channel , build_group , (CAST(frequency AS double) / total) frequency , frequency count FROM (channel_summary cs INNER JOIN counts cc ON (cs.normalized_channel = cc.normalized_channel)) ORDER BY cs.oo ASC

We should be able to convert these into bigquery-etl jobs scheduled with airflow easily enough: we can replace the use of longitudinal with a sample of either main_summary or telemetry.main.

I don't feel like we should change anything else about this dashboard at this time: certainly not moving any logic to python_mozetl, which is going away. :)

Summary: Replace most of the Windows 10 Client Distribution Dashboard's logic → Windows 10 Client Distribution Dashboard needs to be rewritten for bigquery

I am going to attempt to do this.

Assignee: nobody → wlachance

(In reply to Mark Reid [:mreid] from comment #8)

If we do end up scheduling tasks via Airflow that re-run the query and
replace all the data, we should be sure to set the task's "catchup"
parameter to false. That seems to effectively mean that the latest run is
expected to produce all the output that's needed (rather than one
incremental task for each day/period/run).

Let's just use the bigquery-etl model of creating per-day views. We can always aggregate inside the dashboard itself. It should be relatively fast.

Update: Have a draft PR which covers one of the two views that build up this dashboard: https://github.com/mozilla/bigquery-etl/pull/441

It's currently kinda broken but is a start. Another project (porting the mission control v2 dashboard to docker / airflow) is taking most of my cycles, but I intend to return to this soon.

Blocks: 1594761
No longer blocks: data-migration-gcp-query

Wound up converting the hive views to bigquery views against client_daily (see above PR).

New dashboard is here: https://sql.telemetry.mozilla.org/dashboard/windows-10-client-distributions

The old dashboard now redirects to the new one. All old queries have been unscheduled (new ones update every 2 days by default)

I only ported one of the patch queries in the above, but trivial to do more as required. If there are any issues, please file new bugs!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.