Closed Bug 1251398 Opened 9 years ago Closed 9 years ago

Write spark job to generate JSON data for update orphaning dashboard

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: spohl, Assigned: spohl)

References

Details

To run our update orphaning dashboard (currently at http://people.mozilla.org/~spohl/update-orphaning-dashboard/) we're planning to create weekly reports on our orphaning state using longitudinal datasets. The output file will be a JSON file, stored on S3. Once longitudinal datasets are created automatically every week, this spark job could run as a weekly spark job. For now, the intent is to run this once per week using the most recent longitudinal dataset. Note that currently, new datasets aren't created past 2016-02-12 due to bug 1246137.
Roberto, would you be able to review the python notebook for this? Any feedback would be greatly appreciated! https://gist.github.com/sapohl/722f9afe28297e2265bd
Flags: needinfo?(rvitillo)
I forgot to mention that the update orphaning dashboard project has a mana page with a description of this project. I just finished updating it with the most recent changes (such as the fact that we'll be using the longitudinal dataset). The mana page is at: https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=55312399 Also, I've made a few minor corrections to the python notebook and uploaded it as a new revision to the gist: https://gist.github.com/sapohl/722f9afe28297e2265bd (same URL as before)
Depends on: 1242039
The analysis looks OK, I have a few notes though: - There are commented out statements and import statements scattered around the notebook; please clean the notebook up. - Everything stored in the current working directory of the notebook should be uploaded automatically to S3 (telemetry-public-analysis-2/YOURJOBNAME/data) once your scheduled job completes; please remove the manual upload step once you are done with testing.
Flags: needinfo?(rvitillo)
You might also want to try using Spark's accumulators [1] to make the code more elegant. [1] https://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka
Thank you for all the feedback, Roberto! New gist is up: https://gist.github.com/sapohl/722f9afe28297e2265bd Do you happen to know when we're planning to automatically generate weekly longitudinal datasets? And do you know what the format of the URL to these datasets will be? If not, would it make sense to file a bug and block on it before scheduling the notebook here as a weekly Spark job? (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3) > The analysis looks OK, I have a few notes though: > - There are commented out statements and import statements scattered around > the notebook; please clean the notebook up. Fixed. > - Everything stored in the current working directory of the notebook should > be uploaded automatically to S3 > (telemetry-public-analysis-2/YOURJOBNAME/data) once your scheduled job > completes; please remove the manual upload step once you are done with > testing. I believe this is fixed. I wasn't sure what the proper way was to create the .json file in the local directory. Would you mind reviewing this? (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #4) > You might also want to try using Spark's accumulators [1] to make the code > more elegant. > > [1] > https://spark.apache.org/docs/latest/programming-guide.html#accumulators-a- > nameaccumlinka Thank you for pointing me to accumulators. I didn't end up using them in this notebook, but I'm planning to use them in the more in-depth analysis of individual users who are out-of-date. I hope this is ok, but please let me know if you insist on accumulators being used here.
Flags: needinfo?(rvitillo)
(In reply to Stephen A Pohl [:spohl] from comment #5) > Thank you for all the feedback, Roberto! New gist is up: > https://gist.github.com/sapohl/722f9afe28297e2265bd > > Do you happen to know when we're planning to automatically generate weekly > longitudinal datasets? And do you know what the format of the URL to these > datasets will be? If not, would it make sense to file a bug and block on it > before scheduling the notebook here as a weekly Spark job? You can expect a new dataset to be made available every Monday [1]. In the future it will be possible to access the latest available dataset regardless of its creation date (Bug 1251756) > I believe this is fixed. I wasn't sure what the proper way was to create the > .json file in the local directory. Would you mind reviewing this? It looks fine, have you tried scheduling it from a.t.m.o? [1] https://mail.mozilla.org/pipermail/fhr-dev/2016-March/000852.html
Flags: needinfo?(rvitillo)
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #6) > (In reply to Stephen A Pohl [:spohl] from comment #5) > > I believe this is fixed. I wasn't sure what the proper way was to create the > > .json file in the local directory. Would you mind reviewing this? > > It looks fine, have you tried scheduling it from a.t.m.o? I just tried. Unfortunately, the .json file does not appear to be uploaded to S3. I expected the file to be accessible in [1]. The log output[2] of the scheduled job seems to indicate that only the IPython notebook was uploaded, even though the .json file was created in the working directory. Did I miss something to get this .json file uploaded? [1] https://analysis-output.telemetry.mozilla.org/update-orphaning-weekly-analysis/data/20160302.json [2] https://s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/update-orphaning-weekly-analysis/logs/update-orphaning-weekly-analysis.20160302210926.log.gz
Flags: needinfo?(rvitillo)
My bad, you should save your results in the ./output directory. Have a look at the very end of [1]. [1] http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/Addon%20analysis/data/AddonAnalysis.ipynb
Flags: needinfo?(rvitillo)
Points: --- → 2
Priority: -- → P1
Thanks, Roberto! I've updated the notebook[1] with the following: 1. The output file is created in the .output directory. I've confirmed that this uploads the .json file correctly. 2. The URL to the latest dataset is generated dynamically based on the current date. This assumes that a new dataset is created weekly on Monday and has a filename with the date of each Monday. [1] https://gist.github.com/sapohl/722f9afe28297e2265bd
I scheduled this notebook to run weekly on Wednesdays as: Job 1111: update-orphaning-weekly-analysis
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You shouldn't hardcode the day of the week the job runs in case some data has to be backfilled.
My assumption was that this should occur rarely and if it does, I would run the job manually to generate the .json file. Would this work?
We should automate as much as possible so that in case of failure it's easy to backfill data for say the past N weeks.
I've updated the notebook[1] so that it can run on any day of the week. It remains scheduled for Wednesday[2], but it can be run as a one-off on any weekday now without making changes to the code. This allows us to backfill data for the past week when needed. [1] https://gist.github.com/sapohl/722f9afe28297e2265bd [2] Job 1111: update-orphaning-weekly-analysis
Great, thanks!
The dashboard that will use this data is expected to live on telemetry.mozilla.org[1]. Roberto, there's a pull request[2] to get this merged to t.m.o. Do I need to do anything else before this can be merged? Thank you! [1] https://telemetry.mozilla.org/update-orphaning/ [2] https://github.com/mozilla/telemetry-dashboard/pull/226
Flags: needinfo?(rvitillo)
No, feel free to merge it once you have a r+ from chutten.
Flags: needinfo?(rvitillo)
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.