Closed Bug 1251189 Opened 8 years ago Closed 8 years ago

Build Spark Job to export CSV summary data for the fennec-dashboard

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gfritzsche, Assigned: Dexter)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [measurement:client])

To power the fennec-dashboard, we need to built CSV data exports from the "core" ping, following this format:
https://metrics.services.mozilla.com/fennec-dashboard/data/fennec_weekly_data.csv
https://metrics.services.mozilla.com/fennec-dashboard/data/fennec_monthly_data.csv

This currently contains these columns:
os_version,geo,channel,date,actives,abnormals,new_records,d1,d7,d30,hours,google,yahoo,bing,other

abnormals will be cut, search counts will also not be available (at least initially), so depending on the plans we can drop those or fill them with 0s.
The exports can go into: s3://net-mozaws-prod-metrics-data/fennec-dashboard

To keep the convention established by the Desktop v4 dashboard update, we should name them:
fennec-v4-weekly.csv
fennec-v4-monthly.csv
Priority: -- → P2
Whiteboard: [measurement:client]
Depends on: 1253392
Priority: P2 → P1
Assignee: nobody → alessio.placitelli
Blocks: 1251192
Hamilton, what do you think about storing the Spark script used to generate the CSV data on the dashboard repository?

[1] - https://mail.mozilla.org/pipermail/fhr-dev/2016-March/000884.html
Flags: needinfo?(hulmer)
Talking to mreid, we decided to let this live in the pipeline repository for now:
* repo: https://github.com/mozilla-services/data-pipeline/
* path: reports/fennec_dashboard 

That way we can easily find it easily in case we make any bigger changes.
In the medium- to longer-term we'd want to move away from this spark job and power this from a longitudinal, client-oriented or other more appropriate derived stream.
Flags: needinfo?(hulmer)
We will also need to support 3 modes of operation here:
* weekly & monthly for incremental updates of the csv files
* backfill for the whole time period we are looking at

Ideally we'd want to power that from the same notebook just by looking at the submission arguments or the job name.

Roberto, do you have an idea on how we can do that properly?
Can we see the "Spark submission args" there?
Or maybe get the job name and look for a "-weekly"/"-monthly" suffix?
Flags: needinfo?(rvitillo)
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)

> Roberto, do you have an idea on how we can do that properly?
> Can we see the "Spark submission args" there?
> Or maybe get the job name and look for a "-weekly"/"-monthly" suffix?

The job name suffix will work but it's a hack. I filed 1258685.
Flags: needinfo?(rvitillo)
Roberto, any suggestion about how to fetch the job name from a Spark notebook?
Flags: needinfo?(rvitillo)
You could try to read the filename of the notebook (e.g. YOURJOB.ipynb) from the current working directory.
Flags: needinfo?(rvitillo)
I checked that the active users computed by the script in comment 6, for the week starting on the 6th of March ("beta" population) roughly match the ones from this query: https://sql.telemetry.mozilla.org/queries/85/source#table . They do, so we should be producing sane data from the Spark job.
Status: NEW → ASSIGNED
Blocks: 1259505
Blocks: 1260715
This was merged:
https://github.com/mozilla-services/data-pipeline/commit/ddd255e8b2c5440ad94819fcea88678f894bcce3

Currently we can't power the fennec-dashboard yet due to bug 1257589, we will look into scheduling this for Fennec 46 in bug 1260715.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
No longer blocks: 1259505
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.