Closed Bug 1369149 Opened 8 years ago Closed 5 years ago

Develop Deploy Mechanism for Dataset Creation Code

Categories

(Data Platform and Tools :: General, enhancement, P3)

x86
macOS
enhancement
Points:
3

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: frank, Assigned: amiyaguchi)

Details

For telemetry-batch-view and python-mozetl, we have one deploy method: merge to master. The Airflow scripts pull down those repos, build the projects (where applicable), and launch the Spark jobs. Obviously this has some downsides. We can't have multiple versions of these dataset creation scripts, there is no separation of stage and prod, and we can't test out new releases. We should integrate our Airflow scripts, t-b-v releases, and python_mozetl releases . This could involve some sort of JAR building and deploying to s3 along with git releases, and pulling said jars down in the Airflow jobs. There are probably other/easier/better ways that we can investigate. python_mozetl could just use pypi, similar to our other python projects.
Assignee: nobody → amiyaguchi
Points: --- → 3
Priority: -- → P3
There's currently a common entrypoint for mozetl jobs added in bug 1385232. However, the repository should definitely turn into a package with proper tagged releases, but only after the common submission script is utilized to a greater extent. Currently, 10/20 jobs in mozetl are using `mozetl-submit.sh` with the airflow wrapper. The submission script has an option for using alternate git paths and branches, which was primarily for my development workflow. On my local instance of airflow, I can edit the environment appropriately to read the package from my PR branch. I've used this pin versions of a dataset creation script in bug 1404502. The Click environment convention is fairly powerful and it could unify the two repositories under a single command line API. A wrapper script like [1] could abstract away the details of building and submitting jobs from Airflow. [1] https://github.com/acmiyaguchi/telemetry-airflow/blob/c75f08a260c956181801de3ddbffcdcdfe18b5d6/jobs/retention.sh
Whiteboard: [SvcOps] → [DataOps]
Whiteboard: [DataOps]

We now do the equivalent of this in GCP through the docker deployments of bigquery-etl

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.