Port executive report to use main_summary dataset

RESOLVED FIXED

Status

Data Platform and Tools
Datasets: General
P1
normal
RESOLVED FIXED
a year ago
5 months ago

People

(Reporter: mreid, Assigned: amiyaguchi)

Tracking

(Blocks: 1 bug)

Details

Attachments

(2 attachments)

(Reporter)

Description

a year ago
Port code at [1] to use the main_summary dataset. We may not need "inactives" or "five_of_seven" columns initially, as they are expensive to compute and not used for anything at the moment.
(Reporter)

Comment 1

a year ago
This would (I believe) let us get rid of the redshift cluster as well as the scheduled job that populates it.
Blocks: 1255755
(Reporter)

Comment 2

a year ago
This job will also need to consume the crash data, but it can do that using get_pings + a join

Updated

a year ago
Points: --- → 3
Priority: -- → P3
(Reporter)

Updated

a year ago
Blocks: 1309633
(Assignee)

Updated

a year ago
Assignee: nobody → amiyaguchi
(Assignee)

Comment 3

a year ago
I talked to Mark about this bug (and a few others). I'll be picking this up.
(Assignee)

Updated

a year ago
Priority: P3 → P1
(Assignee)

Comment 4

a year ago
Created attachment 8815931 [details]
topline_summary.ipynb

Attached is a pyspark notebook that outlines the general approach to porting the executive/topline summary. 

This notebook takes ~20 minutes with a day of data and ~2:20 with a week of data on a 5 machine cluster. It is prohibitively slow with any more data, and would probably over 10 hours to complete. For reference, the original script takes about 4 hours to run on the redshift cluster.

:mreid and I suspect that most of the time is being spend on user defined functions. A benchmark on regexes between python and java shows a performance difference [1] would be very significant on a large dataset (say 30 days worth of main_summary data). Mark has also mentioned poor performance with python's date string conversions in the past.

I will be rewriting this notebook in scala which will hopefully improve performance. Most of the notebook has been ported aside from collected search count numbers and some tests. Once this is done I can start validating that the numbers look right.


[1] https://benchmarksgame.alioth.debian.org/u64q/python.html
(Assignee)

Comment 5

a year ago
For easier viewing, here's a link via gist: https://gist.github.com/acmiyaguchi/503bfcccad19afe87bc9579e6e08bb9c
(Assignee)

Updated

a year ago
Priority: P1 → P2
(Assignee)

Comment 6

a year ago
I've come at a roadblock with trying to run my implementation of the ToplineSummary. I've narrowed it down to a single function that is hard to unit test [1]. I know it's failing here because I've force the dataframe to collect through a call to .count() and watched it fail on the spark UI. 

The relevant stacktrace shows that it is probably failing on a null attribute somewhere.

> Caused by: java.util.NoSuchElementException: None.get
>	at scala.None$.get(Option.scala:347)
>	at scala.None$.get(Option.scala:345)
>	at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>	at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>	at java.lang.Thread.run(Thread.java:745)

I think that this might be caused by the mapping of `messageToRow` over the `Dataset('telemetry')` rdd. My implementation is similar to the implementation in the MainSummaryView [2].

Any ideas on how to prod at this problem?

[1] https://github.com/acmiyaguchi/telemetry-batch-view/blob/topline-report/src/main/scala/com/mozilla/telemetry/views/ToplineSummary.scala#L161-L199
[2] https://github.com/acmiyaguchi/telemetry-batch-view/blob/topline-report/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L208
Flags: needinfo?(mreid)
(Assignee)

Comment 7

11 months ago
It looks like this issue has something to do with accessing the spark session. [1] My bug follows the same pattern as the reproduced issue. I've implemented a proper singleton which fixes the issues above.

[1] https://issues.apache.org/jira/browse/SPARK-16599
Flags: needinfo?(mreid)
(Assignee)

Updated

11 months ago
Priority: P2 → P1

Comment 8

11 months ago
Created attachment 8824193 [details] [review]
[telemetry-batch-view] acmiyaguchi:topline-report > mozilla:master
(Assignee)

Updated

11 months ago
Depends on: 1329842
(Assignee)

Updated

11 months ago
Depends on: 1329844
(Reporter)

Updated

10 months ago
Blocks: 1320702
(Reporter)

Updated

8 months ago
Blocks: 1352443
(Assignee)

Comment 9

8 months ago
This bug is getting bumped up in light of a day outage this past week. Work will be tracked in bug 1329844. 

The ToplineSummary is current in the review process, but won't be deployable as is. It generates data that is too granular for the dashboard.  A new job will be added to python_etl that performs the same role as reformat_v4.py. It computes 'ALL' rows, and aggregates 'Rest of World' in geo. This also uploads the resulting dataframe to the dashboard buckets. These two jobs will be scheduled on airflow.

I'd like this to run in parallel to the existing job for at least 2 weekly cycles to make sure that it is performing correctly, before swapping over.
(Assignee)

Updated

8 months ago
Depends on: 1357875
(Assignee)

Updated

8 months ago
No longer depends on: 1357875

Updated

8 months ago
Component: Metrics: Pipeline → Datasets: General
Product: Cloud Services → Data Platform and Tools
(Reporter)

Comment 10

5 months ago
This work has been completed in the context of python_mozetl at
https://github.com/mozilla/python_mozetl/tree/master/mozetl/topline
Status: NEW → RESOLVED
Last Resolved: 5 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.