1251398 - Write spark job to generate JSON data for update orphaning dashboard

Assignee

Description

•

9 years ago

To run our update orphaning dashboard (currently at http://people.mozilla.org/~spohl/update-orphaning-dashboard/) we're planning to create weekly reports on our orphaning state using longitudinal datasets. The output file will be a JSON file, stored on S3. Once longitudinal datasets are created automatically every week, this spark job could run as a weekly spark job. For now, the intent is to run this once per week using the most recent longitudinal dataset. Note that currently, new datasets aren't created past 2016-02-12 due to bug 1246137.

Stephen A Pohl [:spohl]

Assignee

Comment 1

•

9 years ago

Roberto, would you be able to review the python notebook for this? Any feedback would be greatly appreciated! https://gist.github.com/sapohl/722f9afe28297e2265bd

Flags: needinfo?(rvitillo)

Stephen A Pohl [:spohl]

Assignee

Comment 2

•

9 years ago

I forgot to mention that the update orphaning dashboard project has a mana page with a description of this project. I just finished updating it with the most recent changes (such as the fact that we'll be using the longitudinal dataset). The mana page is at: https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=55312399 Also, I've made a few minor corrections to the python notebook and uploaded it as a new revision to the gist: https://gist.github.com/sapohl/722f9afe28297e2265bd (same URL as before)

Stephen A Pohl [:spohl]

Assignee

Updated

•

9 years ago

Depends on: 1242039

Roberto Agostino Vitillo (:rvitillo)

Comment 3

•

9 years ago

The analysis looks OK, I have a few notes though: - There are commented out statements and import statements scattered around the notebook; please clean the notebook up. - Everything stored in the current working directory of the notebook should be uploaded automatically to S3 (telemetry-public-analysis-2/YOURJOBNAME/data) once your scheduled job completes; please remove the manual upload step once you are done with testing.

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 4

•

9 years ago

You might also want to try using Spark's accumulators [1] to make the code more elegant. [1] https://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka

Stephen A Pohl [:spohl]

Assignee

Comment 5

•

9 years ago

Thank you for all the feedback, Roberto! New gist is up: https://gist.github.com/sapohl/722f9afe28297e2265bd Do you happen to know when we're planning to automatically generate weekly longitudinal datasets? And do you know what the format of the URL to these datasets will be? If not, would it make sense to file a bug and block on it before scheduling the notebook here as a weekly Spark job? (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3) > The analysis looks OK, I have a few notes though: > - There are commented out statements and import statements scattered around > the notebook; please clean the notebook up. Fixed. > - Everything stored in the current working directory of the notebook should > be uploaded automatically to S3 > (telemetry-public-analysis-2/YOURJOBNAME/data) once your scheduled job > completes; please remove the manual upload step once you are done with > testing. I believe this is fixed. I wasn't sure what the proper way was to create the .json file in the local directory. Would you mind reviewing this? (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #4) > You might also want to try using Spark's accumulators [1] to make the code > more elegant. > > [1] > https://spark.apache.org/docs/latest/programming-guide.html#accumulators-a- > nameaccumlinka Thank you for pointing me to accumulators. I didn't end up using them in this notebook, but I'm planning to use them in the more in-depth analysis of individual users who are out-of-date. I hope this is ok, but please let me know if you insist on accumulators being used here.

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 6

•

9 years ago

(In reply to Stephen A Pohl [:spohl] from comment #5) > Thank you for all the feedback, Roberto! New gist is up: > https://gist.github.com/sapohl/722f9afe28297e2265bd > > Do you happen to know when we're planning to automatically generate weekly > longitudinal datasets? And do you know what the format of the URL to these > datasets will be? If not, would it make sense to file a bug and block on it > before scheduling the notebook here as a weekly Spark job? You can expect a new dataset to be made available every Monday [1]. In the future it will be possible to access the latest available dataset regardless of its creation date (Bug 1251756) > I believe this is fixed. I wasn't sure what the proper way was to create the > .json file in the local directory. Would you mind reviewing this? It looks fine, have you tried scheduling it from a.t.m.o? [1] https://mail.mozilla.org/pipermail/fhr-dev/2016-March/000852.html

Flags: needinfo?(rvitillo)

Stephen A Pohl [:spohl]

Assignee

Comment 7

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #6) > (In reply to Stephen A Pohl [:spohl] from comment #5) > > I believe this is fixed. I wasn't sure what the proper way was to create the > > .json file in the local directory. Would you mind reviewing this? > > It looks fine, have you tried scheduling it from a.t.m.o? I just tried. Unfortunately, the .json file does not appear to be uploaded to S3. I expected the file to be accessible in [1]. The log output[2] of the scheduled job seems to indicate that only the IPython notebook was uploaded, even though the .json file was created in the working directory. Did I miss something to get this .json file uploaded? [1] https://analysis-output.telemetry.mozilla.org/update-orphaning-weekly-analysis/data/20160302.json [2] https://s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/update-orphaning-weekly-analysis/logs/update-orphaning-weekly-analysis.20160302210926.log.gz

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 8

•

9 years ago

My bad, you should save your results in the ./output directory. Have a look at the very end of [1]. [1] http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/Addon%20analysis/data/AddonAnalysis.ipynb

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Points: --- → 2

Priority: -- → P1

Stephen A Pohl [:spohl]

Assignee

Comment 9

•

9 years ago

Thanks, Roberto! I've updated the notebook[1] with the following: 1. The output file is created in the .output directory. I've confirmed that this uploads the .json file correctly. 2. The URL to the latest dataset is generated dynamically based on the current date. This assumes that a new dataset is created weekly on Monday and has a filename with the date of each Monday. [1] https://gist.github.com/sapohl/722f9afe28297e2265bd

Stephen A Pohl [:spohl]

Assignee

Comment 10

•

9 years ago

I scheduled this notebook to run weekly on Wednesdays as: Job 1111: update-orphaning-weekly-analysis

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Roberto Agostino Vitillo (:rvitillo)

Comment 11

•

9 years ago

You shouldn't hardcode the day of the week the job runs in case some data has to be backfilled.

Stephen A Pohl [:spohl]

Assignee

Comment 12

•

9 years ago

My assumption was that this should occur rarely and if it does, I would run the job manually to generate the .json file. Would this work?

Roberto Agostino Vitillo (:rvitillo)

Comment 13

•

9 years ago

We should automate as much as possible so that in case of failure it's easy to backfill data for say the past N weeks.

Stephen A Pohl [:spohl]

Assignee

Comment 14

•

9 years ago

I've updated the notebook[1] so that it can run on any day of the week. It remains scheduled for Wednesday[2], but it can be run as a one-off on any weekday now without making changes to the code. This allows us to backfill data for the past week when needed. [1] https://gist.github.com/sapohl/722f9afe28297e2265bd [2] Job 1111: update-orphaning-weekly-analysis

Roberto Agostino Vitillo (:rvitillo)

Comment 15

•

9 years ago

Great, thanks!

Stephen A Pohl [:spohl]

Assignee

Comment 16

•

9 years ago

The dashboard that will use this data is expected to live on telemetry.mozilla.org[1]. Roberto, there's a pull request[2] to get this merged to t.m.o. Do I need to do anything else before this can be merged? Thank you! [1] https://telemetry.mozilla.org/update-orphaning/ [2] https://github.com/mozilla/telemetry-dashboard/pull/226

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 17

•

9 years ago

No, feel free to merge it once you have a r+ from chutten.

Flags: needinfo?(rvitillo)

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Quick Search

Write spark job to generate JSON data for update orphaning dashboard

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: spohl, Assigned: spohl)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Updated