Port code at  to use the main_summary dataset. We may not need "inactives" or "five_of_seven" columns initially, as they are expensive to compute and not used for anything at the moment.
This would (I believe) let us get rid of the redshift cluster as well as the scheduled job that populates it.
This job will also need to consume the crash data, but it can do that using get_pings + a join
I talked to Mark about this bug (and a few others). I'll be picking this up.
Created attachment 8815931 [details] topline_summary.ipynb Attached is a pyspark notebook that outlines the general approach to porting the executive/topline summary. This notebook takes ~20 minutes with a day of data and ~2:20 with a week of data on a 5 machine cluster. It is prohibitively slow with any more data, and would probably over 10 hours to complete. For reference, the original script takes about 4 hours to run on the redshift cluster. :mreid and I suspect that most of the time is being spend on user defined functions. A benchmark on regexes between python and java shows a performance difference  would be very significant on a large dataset (say 30 days worth of main_summary data). Mark has also mentioned poor performance with python's date string conversions in the past. I will be rewriting this notebook in scala which will hopefully improve performance. Most of the notebook has been ported aside from collected search count numbers and some tests. Once this is done I can start validating that the numbers look right.  https://benchmarksgame.alioth.debian.org/u64q/python.html
For easier viewing, here's a link via gist: https://gist.github.com/acmiyaguchi/503bfcccad19afe87bc9579e6e08bb9c
I've come at a roadblock with trying to run my implementation of the ToplineSummary. I've narrowed it down to a single function that is hard to unit test . I know it's failing here because I've force the dataframe to collect through a call to .count() and watched it fail on the spark UI. The relevant stacktrace shows that it is probably failing on a null attribute somewhere. > Caused by: java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) I think that this might be caused by the mapping of `messageToRow` over the `Dataset('telemetry')` rdd. My implementation is similar to the implementation in the MainSummaryView . Any ideas on how to prod at this problem?  https://github.com/acmiyaguchi/telemetry-batch-view/blob/topline-report/src/main/scala/com/mozilla/telemetry/views/ToplineSummary.scala#L161-L199  https://github.com/acmiyaguchi/telemetry-batch-view/blob/topline-report/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L208
It looks like this issue has something to do with accessing the spark session.  My bug follows the same pattern as the reproduced issue. I've implemented a proper singleton which fixes the issues above.  https://issues.apache.org/jira/browse/SPARK-16599
Created attachment 8824193 [details] [review] [telemetry-batch-view] acmiyaguchi:topline-report > mozilla:master
This bug is getting bumped up in light of a day outage this past week. Work will be tracked in bug 1329844. The ToplineSummary is current in the review process, but won't be deployable as is. It generates data that is too granular for the dashboard. A new job will be added to python_etl that performs the same role as reformat_v4.py. It computes 'ALL' rows, and aggregates 'Rest of World' in geo. This also uploads the resulting dataframe to the dashboard buckets. These two jobs will be scheduled on airflow. I'd like this to run in parallel to the existing job for at least 2 weekly cycles to make sure that it is performing correctly, before swapping over.
This work has been completed in the context of python_mozetl at https://github.com/mozilla/python_mozetl/tree/master/mozetl/topline