Closed Bug 1664317 Opened 4 years ago Closed 4 years ago

Migrate BHR Collection Databricks job

Categories

(Data Platform and Tools :: General, task, P2)

task
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benwu, Assigned: benwu)

References

Details

There's a BHR Collection job scheduled on databricks. Is this still needed going forward? i.e. Should we migrate this to run somewhere else as we move off databricks?

Flags: needinfo?(dothayer)

Yes, the output of that is still being used, and we're looking into new ways to utilize its output. I can look into the details of migrating it, though I'll likely need a bit of assistance / recommendations on which alternative to use.

Flags: needinfo?(dothayer)

I would say the most straightforward way to migrate this is to convert it to a script and data eng can handle scheduling via airflow. I'll be doing this for some other jobs in bug 1664319 which does something similar so I can try to do it together. I'll let you know if I can get this working

Assignee: nobody → bewu
Points: --- → 2
Priority: -- → P2

An example request for a job that runs at 10:15pm each night:
{
"name": "Nightly model training",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "r3.xlarge",
"aws_attributes": {
"availability": "ON_DEMAND"
},
"num_workers": 10
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"email_notifications": {
"on_start": [],
"on_success": [],
"on_failure": []
},
"timeout_seconds": 3600,
"max_retries": 1,
"schedule": {
"quartz_cron_expression": "0 15 22 ? * *",
"timezone_id": "America/Los_Angeles"
},
"spark_jar_task": {
"main_class_name": "com.databricks.ComputeModels"
}
} https://www.indigocard.us/

I've started migrating this over but the job has been failing since last week because the fields in the bhr ping it was using were added to the schema and aren't in additional_properties anymore. I wanted to check one more time that this job is needed. I can put a bit of time in to see if I can fix this but I want to confirm this is worth fixing.

Flags: needinfo?(dothayer)

(In reply to Ben Wu [:benwu] from comment #4)

I've started migrating this over but the job has been failing since last week because the fields in the bhr ping it was using were added to the schema and aren't in additional_properties anymore. I wanted to check one more time that this job is needed. I can put a bit of time in to see if I can fix this but I want to confirm this is worth fixing.

Yes, it's still being used - it was on my TODO list to fix the databricks job for this change. If that would be helpful to you for migrating this I can prioritize that. Or if there's any other way I can help please let me know!

Flags: needinfo?(dothayer)

It would be great if you could fix it this week then I can get it running before the end of the year. Let me know if you can't get to it before the end of the week and I can take a look. It doesn't seem too hard but I'm not very familiar with the job.

(In reply to Ben Wu [:benwu] from comment #6)

It would be great if you could fix it this week then I can get it running before the end of the year. Let me know if you can't get to it before the end of the week and I can take a look. It doesn't seem too hard but I'm not very familiar with the job.

All right - it should be fixed now!

Thanks! I should be able to get it running in dataproc by tomorrow.

The job ran for yesterday and today successfully so I'm unscheduling it in databricks. I'll keep an eye on the job.

Job code: https://github.com/mozilla/python_mozetl/tree/main/mozetl/bhr_collection
Scheduling code: https://github.com/mozilla/telemetry-airflow/blob/master/dags/bhr_collection.py
Job UI: https://workflow.telemetry.mozilla.org/tree?dag_id=bhr_collection

Changes to the job should be prs into python_mozetl. There's work ongoing to find a lower friction way to maintain scheduled jobs.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

Hooray! Thanks Ben.

There were some failures on the first attempt of the day due to cpu quota limits. I'll change the machine type to a higher limit one.

Hey Ben, in order to run this manually, what's the procedure for getting values to provide for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY?

Flags: needinfo?(bewu)

My experience has been to file a bug in Data Platform and Tools::Operations requesting keys with the right permissions. In this case read/write access to the telemetry-public-analysis-2 bucket. It might also be worth considering using GCS bucket instead so that the output and spark jobs can be managed together.

Jason, should a new AWS key pair be created for this (testing the dataproc jobs)? And which project should the manual dataproc jobs run? I see a performance project but a sandbox project could work too and would be better if we want to use GCS as well.

Flags: needinfo?(bewu) → needinfo?(jthomas)

(In reply to Ben Wu [:benwu] from comment #13)

My experience has been to file a bug in Data Platform and Tools::Operations requesting keys with the right permissions. In this case read/write access to the telemetry-public-analysis-2 bucket. It might also be worth considering using GCS bucket instead so that the output and spark jobs can be managed together.

Since this bucket is located in the Cloud Services Dev IAM I would recommend getting credentials to this AWS account via https://mana.mozilla.org/wiki/display/SVCOPS/Requesting+A+Dev+IAM+account+from+Cloud+Operations. This will allow you to provision a new bucket or reuse telemetry-public-analysis-2 for testing purposes.

The GCS bucket option might be okay for testing purposes. I believe the output of this job is consumed by a public dashboard so if we want to completely switch this job to use GCS we will need to stand up something similar to analysis-output.telemetry.mozilla.org in GCP, deploy this dataset as public-data set or use protodash.

Jason, should a new AWS key pair be created for this (testing the dataproc jobs)? And which project should the manual dataproc jobs run? I see a performance project but a sandbox project could work too and would be better if we want to use GCS as well.

We should use the team projects. I believe it's only configured on some team projects right now but I will enable it on the performance team project.[1]

[1] https://docs.telemetry.mozilla.org/tools/spark.html?highlight=dataproc#using-dataproc

Flags: needinfo?(jthomas)

We should use the team projects. I believe it's only configured on some team projects right now but I will enable it on the performance team project.[1]

This is done on the performance team project.

You need to log in before you can comment on or make changes to this bug.