Replace most of the Windows 10 Client Distribution Dashboard's logic

NEW
Assigned to

Status

P2
normal
a year ago
11 months ago

People

(Reporter: harter, Assigned: wlach)

Tracking

Details

(Reporter)

Description

a year ago
The Windows 10 Client Dashboard [1] is actively used by QA/Release engineering to monitor uptake of new Windows releases.

This dashboard heavily uses virtual tables to share logic between queries. However, this makes the dashboard very difficult to maintain. All upstream queries which create the virtual tables cannot be collaborated on and the SQL queries are not version controlled. 

Instead, we should have a derived dataset generated in python_mozetl with a few simple queries feeding the dashboard.

[1] https://sql.telemetry.mozilla.org/dashboard/windows-10-user-strata
(Reporter)

Comment 1

a year ago
Will, do you have any interest in working on this?
Flags: needinfo?(wlachance)
Drive-by comment: I would rather have a versioned queries, and the capabilities that come with that, than have to create tedious ETL jobs for things like views. If we followed-through with bug 1274990 (which really is not very difficult), we could have a separate repo for scheduled queries.
(Reporter)

Comment 3

a year ago
What advantage does creating a new Airflow operator have over keeping this logic in python_mozetl? Airflow has proven to be difficult to test. I could see a simple python wrapper for executing a query and saving the result to S3. 

I guess the cost is booting up a new cluster for simple ETL?
Flags: needinfo?(fbertsch)
(In reply to Ryan Harter [:harter] from comment #3)
> What advantage does creating a new Airflow operator have over keeping this
> logic in python_mozetl? Airflow has proven to be difficult to test. I could
> see a simple python wrapper for executing a query and saving the result to
> S3. 
> 
> I guess the cost is booting up a new cluster for simple ETL?

I suppose in reality these are separate asks:
1. If we use Presto (i.e. a new Airflow operator), it doesn't support materialized views; the view query would need to run every time a query that used it runs.
2. If we use an ETL job, it materializes that data, which will be faster for the queries that use it. Downsides are more testing and maintaining of YAETL (yet-another-ETL ;)).

SQL queries are meant to be declarative and simpler. I would much appreciate the ability to schedule and use them in Airflow; this bug just seemed like a good vehicle to do so :)
Flags: needinfo?(fbertsch)
(In reply to Ryan Harter [:harter] from comment #1)
> Will, do you have any interest in working on this?

Sorry for the late reply, yes I am interested. I defer to you and Frank on the best way of implementing this, the python-mozetl approach would probably be a lot easier for me to implement but am happy to work on the airflow operator as well if it seems to make more sense.
Flags: needinfo?(wlachance)
(Reporter)

Comment 6

a year ago
(In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> (In reply to Ryan Harter [:harter] from comment #1)
> > Will, do you have any interest in working on this?
> 
> Sorry for the late reply, yes I am interested. 

Awesome!

> I defer to you and Frank on
> the best way of implementing this, the python-mozetl approach would probably
> be a lot easier for me to implement but am happy to work on the airflow
> operator as well if it seems to make more sense.

I think python-mozetl is a better fit for this task since it would be better if these datasets were materialized. It currently takes ~5-10m to run a query which makes it difficult to debug problems.

As noted in Bug 1393830, this isn't an immediate need. This has, however, been a maintenance burden so I'd like to make progress while we can.

I'll set up a meeting next week to discuss.
(Reporter)

Updated

a year ago
Assignee: nobody → wlachance
FYI , I am going to be requesting an update to the dashboard within the next couple weeks once the Fall Creators Update is finalized. Not saying this needs to change any immediate plans, just a heads-up.

Updated

11 months ago
Component: General → Datasets: General

Comment 8

11 months ago
If we do end up scheduling tasks via Airflow that re-run the query and replace all the data, we should be sure to set the task's "catchup" parameter to false. That seems to effectively mean that the latest run is expected to produce all the output that's needed (rather than one incremental task for each day/period/run).

See:
https://airflow.apache.org/scheduler.html#backfill-and-catchup
You need to log in before you can comment on or make changes to this bug.