Migrate BHR Collection Databricks job
Categories
(Data Platform and Tools :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: benwu, Assigned: benwu)
References
Details
There's a BHR Collection
job scheduled on databricks. Is this still needed going forward? i.e. Should we migrate this to run somewhere else as we move off databricks?
Comment 1•4 years ago
|
||
Yes, the output of that is still being used, and we're looking into new ways to utilize its output. I can look into the details of migrating it, though I'll likely need a bit of assistance / recommendations on which alternative to use.
Assignee | ||
Comment 2•4 years ago
|
||
I would say the most straightforward way to migrate this is to convert it to a script and data eng can handle scheduling via airflow. I'll be doing this for some other jobs in bug 1664319 which does something similar so I can try to do it together. I'll let you know if I can get this working
Updated•4 years ago
|
Comment 3•4 years ago
|
||
An example request for a job that runs at 10:15pm each night:
{
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "r3.xlarge",
"aws_attributes": {
"availability": "ON_DEMAND"
},
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"email_notifications": {
"on_start": [],
"on_success": [],
"on_failure": []
},
"timeout_seconds": 3600,
"max_retries": 1,
"schedule": {
"quartz_cron_expression": "0 15 22 ? * *",
"timezone_id": "America/Los_Angeles"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
} https://www.indigocard.us/
Assignee | ||
Comment 4•4 years ago
|
||
I've started migrating this over but the job has been failing since last week because the fields in the bhr ping it was using were added to the schema and aren't in additional_properties anymore. I wanted to check one more time that this job is needed. I can put a bit of time in to see if I can fix this but I want to confirm this is worth fixing.
Comment 5•4 years ago
|
||
(In reply to Ben Wu [:benwu] from comment #4)
I've started migrating this over but the job has been failing since last week because the fields in the bhr ping it was using were added to the schema and aren't in additional_properties anymore. I wanted to check one more time that this job is needed. I can put a bit of time in to see if I can fix this but I want to confirm this is worth fixing.
Yes, it's still being used - it was on my TODO list to fix the databricks job for this change. If that would be helpful to you for migrating this I can prioritize that. Or if there's any other way I can help please let me know!
Assignee | ||
Comment 6•4 years ago
|
||
It would be great if you could fix it this week then I can get it running before the end of the year. Let me know if you can't get to it before the end of the week and I can take a look. It doesn't seem too hard but I'm not very familiar with the job.
Comment 7•4 years ago
|
||
(In reply to Ben Wu [:benwu] from comment #6)
It would be great if you could fix it this week then I can get it running before the end of the year. Let me know if you can't get to it before the end of the week and I can take a look. It doesn't seem too hard but I'm not very familiar with the job.
All right - it should be fixed now!
Assignee | ||
Comment 8•4 years ago
|
||
Thanks! I should be able to get it running in dataproc by tomorrow.
Assignee | ||
Comment 9•4 years ago
|
||
The job ran for yesterday and today successfully so I'm unscheduling it in databricks. I'll keep an eye on the job.
Job code: https://github.com/mozilla/python_mozetl/tree/main/mozetl/bhr_collection
Scheduling code: https://github.com/mozilla/telemetry-airflow/blob/master/dags/bhr_collection.py
Job UI: https://workflow.telemetry.mozilla.org/tree?dag_id=bhr_collection
Changes to the job should be prs into python_mozetl. There's work ongoing to find a lower friction way to maintain scheduled jobs.
Comment 10•4 years ago
|
||
Hooray! Thanks Ben.
Assignee | ||
Comment 11•4 years ago
|
||
There were some failures on the first attempt of the day due to cpu quota limits. I'll change the machine type to a higher limit one.
Comment 12•4 years ago
|
||
Hey Ben, in order to run this manually, what's the procedure for getting values to provide for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY?
Assignee | ||
Comment 13•4 years ago
|
||
My experience has been to file a bug in Data Platform and Tools::Operations requesting keys with the right permissions. In this case read/write access to the telemetry-public-analysis-2
bucket. It might also be worth considering using GCS bucket instead so that the output and spark jobs can be managed together.
Jason, should a new AWS key pair be created for this (testing the dataproc jobs)? And which project should the manual dataproc jobs run? I see a performance project but a sandbox project could work too and would be better if we want to use GCS as well.
Comment 14•4 years ago
|
||
(In reply to Ben Wu [:benwu] from comment #13)
My experience has been to file a bug in Data Platform and Tools::Operations requesting keys with the right permissions. In this case read/write access to the
telemetry-public-analysis-2
bucket. It might also be worth considering using GCS bucket instead so that the output and spark jobs can be managed together.
Since this bucket is located in the Cloud Services Dev IAM I would recommend getting credentials to this AWS account via https://mana.mozilla.org/wiki/display/SVCOPS/Requesting+A+Dev+IAM+account+from+Cloud+Operations. This will allow you to provision a new bucket or reuse telemetry-public-analysis-2
for testing purposes.
The GCS bucket option might be okay for testing purposes. I believe the output of this job is consumed by a public dashboard so if we want to completely switch this job to use GCS we will need to stand up something similar to analysis-output.telemetry.mozilla.org in GCP, deploy this dataset as public-data set or use protodash.
Jason, should a new AWS key pair be created for this (testing the dataproc jobs)? And which project should the manual dataproc jobs run? I see a performance project but a sandbox project could work too and would be better if we want to use GCS as well.
We should use the team projects. I believe it's only configured on some team projects right now but I will enable it on the performance team project.[1]
[1] https://docs.telemetry.mozilla.org/tools/spark.html?highlight=dataproc#using-dataproc
Comment 15•4 years ago
|
||
We should use the team projects. I believe it's only configured on some team projects right now but I will enable it on the performance team project.[1]
This is done on the performance team project.
Description
•