1399532 - Windows 10 Client Distribution Dashboard needs to be rewritten for bigquery

Reporter

Description

•

7 years ago

The Windows 10 Client Dashboard [1] is actively used by QA/Release engineering to monitor uptake of new Windows releases.

This dashboard heavily uses virtual tables to share logic between queries. However, this makes the dashboard very difficult to maintain. All upstream queries which create the virtual tables cannot be collaborated on and the SQL queries are not version controlled. 

Instead, we should have a derived dataset generated in python_mozetl with a few simple queries feeding the dashboard.

[1] https://sql.telemetry.mozilla.org/dashboard/windows-10-user-strata

Ryan Harter [:harter]

Reporter

Comment 1

•

7 years ago

Will, do you have any interest in working on this?

Flags: needinfo?(wlachance)

Frank Bertsch [:frank]

Comment 2

•

7 years ago

Drive-by comment: I would rather have a versioned queries, and the capabilities that come with that, than have to create tedious ETL jobs for things like views. If we followed-through with bug 1274990 (which really is not very difficult), we could have a separate repo for scheduled queries.

Ryan Harter [:harter]

Reporter

Comment 3

•

7 years ago

What advantage does creating a new Airflow operator have over keeping this logic in python_mozetl? Airflow has proven to be difficult to test. I could see a simple python wrapper for executing a query and saving the result to S3. 

I guess the cost is booting up a new cluster for simple ETL?

Flags: needinfo?(fbertsch)

Frank Bertsch [:frank]

Comment 4

•

7 years ago

(In reply to Ryan Harter [:harter] from comment #3)
> What advantage does creating a new Airflow operator have over keeping this
> logic in python_mozetl? Airflow has proven to be difficult to test. I could
> see a simple python wrapper for executing a query and saving the result to
> S3. 
> 
> I guess the cost is booting up a new cluster for simple ETL?

I suppose in reality these are separate asks:
1. If we use Presto (i.e. a new Airflow operator), it doesn't support materialized views; the view query would need to run every time a query that used it runs.
2. If we use an ETL job, it materializes that data, which will be faster for the queries that use it. Downsides are more testing and maintaining of YAETL (yet-another-ETL ;)).

SQL queries are meant to be declarative and simpler. I would much appreciate the ability to schedule and use them in Airflow; this bug just seemed like a good vehicle to do so :)

Flags: needinfo?(fbertsch)

William Lachance (:wlach)

Assignee

Comment 5

•

7 years ago

(In reply to Ryan Harter [:harter] from comment #1)
> Will, do you have any interest in working on this?

Sorry for the late reply, yes I am interested. I defer to you and Frank on the best way of implementing this, the python-mozetl approach would probably be a lot easier for me to implement but am happy to work on the airflow operator as well if it seems to make more sense.

Flags: needinfo?(wlachance)

Ryan Harter [:harter]

Reporter

Comment 6

•

7 years ago

(In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> (In reply to Ryan Harter [:harter] from comment #1)
> > Will, do you have any interest in working on this?
> 
> Sorry for the late reply, yes I am interested. 

Awesome!

> I defer to you and Frank on
> the best way of implementing this, the python-mozetl approach would probably
> be a lot easier for me to implement but am happy to work on the airflow
> operator as well if it seems to make more sense.

I think python-mozetl is a better fit for this task since it would be better if these datasets were materialized. It currently takes ~5-10m to run a query which makes it difficult to debug problems.

As noted in Bug 1393830, this isn't an immediate need. This has, however, been a maintenance burden so I'd like to make progress while we can.

I'll set up a meeting next week to discuss.

Ryan Harter [:harter]

Reporter

Updated

•

7 years ago

Assignee: nobody → wlachance

Ryan VanderMeulen [:RyanVM]

Comment 7

•

7 years ago

FYI , I am going to be requesting an update to the dashboard within the next couple weeks once the Fall Creators Update is finalized. Not saying this needs to change any immediate plans, just a heads-up.

Mark Reid [:mreid]

Updated

•

7 years ago

Component: General → Datasets: General

Mark Reid [:mreid]

Comment 8

•

7 years ago

If we do end up scheduling tasks via Airflow that re-run the query and replace all the data, we should be sure to set the task's "catchup" parameter to false. That seems to effectively mean that the latest run is expected to produce all the output that's needed (rather than one incremental task for each day/period/run).

See:
https://airflow.apache.org/scheduler.html#backfill-and-catchup

William Lachance (:wlach)

Assignee

Comment 9

•

5 years ago

Harter, is this bug still relevant?

Assignee: wlachance → nobody

Flags: needinfo?(rharter)

Ryan VanderMeulen [:RyanVM]

Comment 10

•

5 years ago

My understanding is that the dashboard's future is up in the air at the moment (though FWIW, it's still one I use from time to time, including recently while tracking patch uptake for bug 1585109).

Ryan Harter [:harter]

Reporter

Comment 11

•

5 years ago

The dashboard will be broken with the deprecation of longitudinal. RyanVM is still interested in this dashboard. We can keep this bug to cover the work needed to rebuild the dashboard or we can file a new bug. FWIW, it sounds like we don't have resources to handle the rebuild on the DS team.

Flags: needinfo?(rharter)

William Lachance (:wlach)

Assignee

Comment 12

•

5 years ago

•

Edited

Ok. Did it turns out this dashboard is generated from some datasets that are two hive metastore queries against longitudinal (thanks :jason for providing the references):

CREATE VIEW hive.default.windows_10_aggregate AS WITH filtered AS ( SELECT * FROM longitudinal WHERE ((((system_os[1].name = 'Windows_NT') AND (system_os[1].version LIKE '10%')) AND (TRY_CAST("split_part"(build[1].version, '.', 1) AS bigint) >= 47)) AND ("date_diff"('day', "date_parse"(submission_date[1], '%Y-%m-%dT00:00:00Z'), current_date) < 28)) ) , aggregated AS ( SELECT system_os[1].name windows_name , system_os[1].version windows_version , system_os[1].windows_build_number build_number , TRY_CAST(system_os[1].windows_ubr AS varchar) windows_ubr , IF((system_os[1].windows_build_number <= 10240), '1507', IF((system_os[1].windows_build_number <= 10586), '1511', IF((system_os[1].windows_build_number <= 14393), '1607', IF((system_os[1].windows_build_number <= 15063), '1703', IF((system_os[1].windows_build_number <= 16299), '1709', IF((system_os[1].windows_build_number <= 17134), '1803', IF((system_os[1].windows_build_number <= 17763), '1809', IF((system_os[1].windows_build_number <= 18362), '1903', IF((system_os[1].windows_build_number > 18362), 'Insider', null))))))))) build_group , TRY_CAST("split_part"(build[1].version, '.', 1) AS bigint) ff_build_version , normalized_channel , "count"(1) count FROM filtered GROUP BY 1, 2, 3, 4, 5, 6, 7 ORDER BY COUNT DESC ) SELECT * FROM (aggregated CROSS JOIN ( SELECT "sum"(count) total_obs FROM aggregated ) total (total_obs))

CREATE VIEW hive.default.build_distribution AS WITH channel_summary AS ( SELECT normalized_channel , build_group , "max"(build_number) oo , "sum"(count) frequency FROM windows_10_aggregate GROUP BY 1, 2 ORDER BY 2 ASC ) , counts AS ( SELECT normalized_channel , "sum"(frequency) total FROM channel_summary GROUP BY 1 ) SELECT cs.normalized_channel , build_group , (CAST(frequency AS double) / total) frequency , frequency count FROM (channel_summary cs INNER JOIN counts cc ON (cs.normalized_channel = cc.normalized_channel)) ORDER BY cs.oo ASC

We should be able to convert these into bigquery-etl jobs scheduled with airflow easily enough: we can replace the use of longitudinal with a sample of either main_summary or telemetry.main.

I don't feel like we should change anything else about this dashboard at this time: certainly not moving any logic to python_mozetl, which is going away. :)

Summary: Replace most of the Windows 10 Client Distribution Dashboard's logic → Windows 10 Client Distribution Dashboard needs to be rewritten for bigquery

William Lachance (:wlach)

Assignee

Comment 13

•

5 years ago

I am going to attempt to do this.

Assignee: nobody → wlachance

William Lachance (:wlach)

Assignee

Comment 14

•

5 years ago

(In reply to Mark Reid [:mreid] from comment #8)

If we do end up scheduling tasks via Airflow that re-run the query and
replace all the data, we should be sure to set the task's "catchup"
parameter to false. That seems to effectively mean that the latest run is
expected to produce all the output that's needed (rather than one
incremental task for each day/period/run).

Let's just use the bigquery-etl model of creating per-day views. We can always aggregate inside the dashboard itself. It should be relatively fast.

William Lachance (:wlach)

Assignee

Comment 15

•

5 years ago

Update: Have a draft PR which covers one of the two views that build up this dashboard: https://github.com/mozilla/bigquery-etl/pull/441

It's currently kinda broken but is a start. Another project (porting the mission control v2 dashboard to docker / airflow) is taking most of my cycles, but I intend to return to this soon.

William Lachance (:wlach)

Assignee

Updated

•

5 years ago

Blocks: data-migration-gcp-query

William Lachance (:wlach)

Assignee

Updated

•

5 years ago

Blocks: 1594761
No longer blocks: data-migration-gcp-query

William Lachance (:wlach)

Assignee

Comment 16

•

5 years ago

Wound up converting the hive views to bigquery views against client_daily (see above PR).

New dashboard is here: https://sql.telemetry.mozilla.org/dashboard/windows-10-client-distributions

The old dashboard now redirects to the new one. All old queries have been unscheduled (new ones update every 2 days by default)

I only ported one of the patch queries in the above, but trivial to do more as required. If there are any issues, please file new bugs!

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

2 years ago

Component: Datasets: General → General

Bugzilla

Quick Search

Windows 10 Client Distribution Dashboard needs to be rewritten for bigquery

Categories

(Data Platform and Tools :: General, enhancement, P2)

Tracking

(Not tracked)

People

(Reporter: harter, Assigned: wlach)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Updated

Updated

Comment 16

Updated